BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Dongxu Li; Junnan Li; Silvio Savarese; Steven Hoi

arxiv: 2301.12597 · v3 · submitted 2023-01-30 · 💻 cs.CV

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li , Dongxu Li , Silvio Savarese , Steven Hoi This is my paper

Pith reviewed 2026-05-12 00:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language pre-trainingbootstrappingfrozen encodersQuerying Transformerzero-shot VQAmultimodal modelslarge language modelsimage-to-text generation

0 comments

The pith

BLIP-2 connects frozen image encoders and large language models with a lightweight Querying Transformer to bootstrap efficient vision-language pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that the high cost of vision-and-language pre-training can be avoided by bootstrapping from already-trained frozen image encoders and frozen large language models rather than training everything end-to-end. This matters because full joint training of large multimodal models requires prohibitive compute resources. BLIP-2 introduces a small Querying Transformer trained in two stages: the first stage learns visual representations aligned to language from the frozen image encoder, and the second stage learns to generate language outputs from those representations using the frozen language model. If this works, the resulting models deliver strong performance on tasks such as visual question answering and image captioning while training far fewer parameters than prior methods.

Core claim

BLIP-2 is a generic pre-training strategy that freezes a pre-trained image encoder and a pre-trained large language model, then trains only a lightweight Querying Transformer in two stages to bridge the modality gap. The first stage bootstraps vision-language representation learning from the frozen image encoder. The second stage bootstraps vision-to-language generative learning from the frozen language model. This produces models that reach state-of-the-art results on vision-language tasks despite using significantly fewer trainable parameters than existing approaches, such as outperforming Flamingo80B by 8.7 percent on zero-shot VQAv2 with 54 times fewer trainable parameters, and that also

What carries the argument

The Querying Transformer (Q-Former), a small transformer module that extracts a fixed set of visual query embeddings from the frozen image encoder and feeds them as input to the frozen language model.

If this is right

Zero-shot visual question answering performance can exceed that of models with orders of magnitude more trainable parameters.
Zero-shot image-to-text generation becomes possible that follows free-form natural language instructions.
Pre-training compute is limited to the small Querying Transformer rather than the full size of the image encoder or language model.
The same bridging approach works across different choices of frozen image encoders and language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Modality gaps between separately trained models may be bridgeable modularly, reducing the need to retrain large components when new data or tasks appear.
The separation of representation learning and generative alignment into two stages could be applied to connect other frozen models such as audio encoders to language models.
Efficiency gains from bootstrapping suggest that scaling curves for multimodal systems should consider the cost of the bridge module separately from the frozen backbones.

Load-bearing premise

A small Querying Transformer trained on frozen components can learn sufficient alignment between image features and language model inputs without any end-to-end updates to the large frozen models.

What would settle it

An end-to-end fine-tuned version of the same base image encoder and language model on identical pre-training data and compute budget would need to be evaluated on zero-shot VQAv2 to check whether the frozen approach loses accuracy that joint training recovers.

read the original abstract

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BLIP-2 shows a small Q-Former can align frozen vision and language backbones well enough to beat much larger models on zero-shot tasks while training far fewer parameters.

read the letter

BLIP-2's main point is that you can bootstrap capable vision-language models by training only a 188M-parameter Querying Transformer on top of completely frozen image encoders and LLMs. The two-stage recipe first aligns the Q-Former to the vision encoder with contrastive and matching losses, then uses it to condition the LLM for generative pre-training. This keeps the heavy components untouched and cuts trainable parameters by a large factor compared to end-to-end approaches like Flamingo.

Referee Report

2 major / 2 minor

Summary. The paper introduces BLIP-2, a vision-language pre-training approach that bootstraps from off-the-shelf frozen image encoders and frozen large language models by inserting a lightweight Querying Transformer (Q-Former). The Q-Former is trained in two stages: first using image-text contrastive and matching objectives on the frozen vision encoder, then using language modeling objectives on the frozen LLM. The method reports state-of-the-art results across vision-language tasks while using far fewer trainable parameters than prior work; the headline empirical claim is an 8.7% gain over Flamingo-80B on zero-shot VQAv2 with 54× fewer trainable parameters, plus emerging zero-shot image-to-text generation that follows natural-language instructions.

Significance. If the central empirical claims hold, the work demonstrates a practical and computationally efficient route to high-performing vision-language models that avoids end-to-end training of billion-parameter backbones. The two-stage bootstrapping strategy and the parameter-efficiency result are the primary contributions; they directly address the prohibitive cost of full multimodal pre-training and could influence future model design by showing that a modest bridging module can extract usable visual information from frozen encoders. The zero-shot instruction-following capability is an additional positive signal.

major comments (2)

[§3.2–3.3] §3.2–3.3: The two-stage Q-Former training procedure is described in detail, yet no ablation is presented that unfreezes either the image encoder or the LLM (or both) and measures the resulting change in downstream performance. Without this comparison it is impossible to determine whether the reported performance ceiling is limited by the frozen-backbone constraint or whether the Q-Former truly extracts all necessary visual information.
[Abstract] Abstract and experimental claims: The 8.7% zero-shot VQAv2 improvement over Flamingo-80B is presented as a key result, but the visible text provides no table or section that lists the exact training data, evaluation splits, prompt templates, or hyper-parameters used for both models. This information is load-bearing for verifying that the efficiency advantage is not an artifact of mismatched experimental conditions.

minor comments (2)

The notation for the Q-Former queries and the two-stage loss functions should be introduced with explicit equations and a diagram that shows which components are frozen at each stage.
Figure captions and table footnotes should explicitly state the number of trainable parameters for every compared model so that the 54× claim can be checked at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of our experimental design and presentation. We address each major comment below and propose targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2–3.3] The two-stage Q-Former training procedure is described in detail, yet no ablation is presented that unfreezes either the image encoder or the LLM (or both) and measures the resulting change in downstream performance. Without this comparison it is impossible to determine whether the reported performance ceiling is limited by the frozen-backbone constraint or whether the Q-Former truly extracts all necessary visual information.

Authors: We agree that a direct ablation unfreezing the image encoder or LLM would provide valuable additional evidence. However, such experiments would require training models with billions of parameters end-to-end, which is computationally prohibitive and directly contradicts the paper's central goal of demonstrating high performance while keeping the backbones frozen. Our results already show that the lightweight Q-Former can extract sufficient visual information to achieve state-of-the-art zero-shot performance. We will add a new paragraph in Section 4 (or a dedicated limitations subsection) discussing the rationale for the frozen setting, the expected trade-offs of unfreezing, and why we consider the current results sufficient to support our claims. revision: partial
Referee: [Abstract] The 8.7% zero-shot VQAv2 improvement over Flamingo-80B is presented as a key result, but the visible text provides no table or section that lists the exact training data, evaluation splits, prompt templates, or hyper-parameters used for both models. This information is load-bearing for verifying that the efficiency advantage is not an artifact of mismatched experimental conditions.

Authors: We thank the referee for pointing out the need for greater transparency. The training data, evaluation splits, and prompt templates for BLIP-2 are detailed in Sections 4.1, 4.2, and the appendix; the Flamingo-80B numbers are taken directly from the original Flamingo paper using the identical zero-shot VQAv2 protocol. To eliminate any ambiguity, we will insert a new table (or expanded subsection in Section 4) that explicitly tabulates the data sources, splits, prompts, and hyper-parameter settings for both models, along with citations to the Flamingo paper for the comparison numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims are empirical benchmarks against external models

full rationale

The paper's derivation consists of a two-stage training procedure for the Q-Former on frozen ViT and LLM backbones, with performance evaluated on standard external benchmarks (VQAv2, etc.) and compared to Flamingo-80B. These results do not reduce to any internal fitted parameter or self-defined quantity by construction. No equations or steps in the provided text exhibit self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that replace independent verification. The efficiency claim (54x fewer parameters) is a direct count of trainable parameters, not a derived prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that separately pre-trained unimodal models already contain sufficiently aligned representations that a small learned interface can exploit.

axioms (1)

domain assumption Frozen pre-trained image encoders and large language models retain useful cross-modal information that a lightweight interface can extract without further updating the large models.
Invoked to justify keeping both encoders frozen throughout training.

invented entities (1)

Querying Transformer (Q-Former) no independent evidence
purpose: Lightweight module that queries visual features from the frozen image encoder and conditions the frozen language model.
New architectural component introduced to bridge the two frozen models.

pith-pipeline@v0.9.0 · 5479 in / 1242 out tokens · 50319 ms · 2026-05-12T00:04:15.143007+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
cs.MM 2026-04 unverdicted novelty 7.0

Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
cs.CV 2026-04 conditional novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
cs.CV 2026-03 unverdicted novelty 7.0

WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization
cs.GR 2026-01 unverdicted novelty 7.0

LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.
SAM 3: Segment Anything with Concepts
cs.CV 2025-11 unverdicted novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
cs.RO 2024-09 conditional novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms ...
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
cs.CV 2024-06 unverdicted novelty 7.0

Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
3D-VLA: A 3D Vision-Language-Action Generative World Model
cs.CV 2024-03 unverdicted novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
LRM: Large Reconstruction Model for Single Image to 3D
cs.CV 2023-11 conditional novelty 7.0

LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
cs.CV 2023-10 unverdicted novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
cs.CV 2023-09 unverdicted novelty 7.0

DreamGaussian creates high-quality textured 3D meshes from single-view images in 2 minutes via generative Gaussian Splatting with mesh extraction and UV refinement.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Evaluating Object Hallucination in Large Vision-Language Models
cs.CV 2023-05 accept novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
ViperGPT: Visual Inference via Python Execution for Reasoning
cs.CV 2023-03 unverdicted novelty 7.0

ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
cs.CV 2023-03 accept novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
Language Is Not All You Need: Aligning Perception with Language Models
cs.CL 2023-02 conditional novelty 7.0

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
cs.CV 2026-05 unverdicted novelty 6.0

StyleTextGen proposes a dual-branch style encoder, text style consistency loss, and mask-guided inference to achieve superior style consistency and cross-lingual performance in multilingual scene text generation on a ...
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
cs.CV 2026-05 unverdicted novelty 6.0

MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
cs.CR 2026-05 conditional novelty 6.0

Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
eess.IV 2026-04 unverdicted novelty 6.0

Fine-tuned multimodal LLMs predict mouse social dominance from raw tube test videos with high agreement to traditional rankings.
ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
cs.AI 2026-04 unverdicted novelty 6.0

ReactBench benchmark shows MLLMs suffer over 30% performance drop on complex topological reasoning tasks versus basic ones when evaluated on chemical reaction diagrams.
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
cs.CV 2026-04 unverdicted novelty 6.0

AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
UniRec: Unified Multimodal Encoding for LLM-Based Recommendations
cs.IR 2026-01 unverdicted novelty 6.0

UniRec unifies heterogeneous recommendation modalities via specialized encoders, triplet representations, and hierarchical modeling to outperform prior multimodal LLM recommenders by up to 15% on benchmarks.
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 6.0

Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
A cross-species neural foundation model for end-to-end speech decoding
cs.CL 2025-11 unverdicted novelty 6.0

A cross-species pretrained neural encoder combined with end-to-end training and audio LLMs reduces word error rate in neural speech decoding from 24.69% to 10.22% while aligning attempted and imagined speech.
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
cs.CV 2025-11 unverdicted novelty 6.0

RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefic...
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?
cs.CV 2025-07 conditional novelty 6.0

The ITW-SM dataset and targeted optimization of detector design choices yield a 26.87% average AUC improvement for state-of-the-art AI-generated image detectors under real-world social media conditions.
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
cs.CV 2025-07 unverdicted novelty 6.0

Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion
cs.CV 2025-03 unverdicted novelty 6.0

RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA w...
MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts
cs.CL 2024-11 unverdicted novelty 6.0

MolReFlect introduces a teacher-student framework that automatically creates fine-grained molecule-text alignments to achieve SOTA results on molecule-caption translation.
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
cs.CV 2024-11 unverdicted novelty 6.0

PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
cs.CV 2024-10 unverdicted novelty 6.0

Presents ChatSearch dataset and ChatSearcher generative model for conversational image retrieval on open-domain images, claiming superior performance on the new dataset and competitive results elsewhere.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
cs.CV 2024-05 unverdicted novelty 6.0

SketchDeco performs training-free sketch colourisation via diffusion inversion to insert user colors followed by custom self-attention blending for local fidelity and global harmony.
BLINK: Multimodal Large Language Models Can See but Not Perceive
cs.CV 2024-04 accept novelty 6.0

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
cs.CV 2024-03 conditional novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
cs.CV 2024-02 unverdicted novelty 6.0

NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
cs.CV 2024-01 conditional novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
InstantID: Zero-shot Identity-Preserving Generation in Seconds
cs.CV 2024-01 unverdicted novelty 6.0

InstantID enables zero-shot identity-preserving image generation from one facial image via a novel IdentityNet that combines strong semantic and weak spatial conditioning with text prompts in diffusion models.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
cs.LG 2023-10 conditional novelty 6.0

LURE reduces object hallucination in LVLMs by 23% via post-hoc revision informed by co-occurrence, uncertainty, and text position analysis.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
cs.CV 2023-08 unverdicted novelty 6.0

DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
cs.CV 2023-08 unverdicted novelty 6.0

IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
cs.CV 2023-07 unverdicted novelty 6.0

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 111 Pith papers · 10 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Has- son, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., and Si- monyan, K. Flami...

work page internal anchor Pith review arXiv
[2]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Chen, J., Guo, H., Yi, K., Li, B., and Elhoseiny, M. Visu- algpt: Data-efficient adaptation of pretrained language models for image captioning. InCVPR, pp. 18009–18019, 2022a. Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A. J., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K...

work page internal anchor Pith review arXiv
[3]

Unifying vision-and-language tasks via text generation

Cho, J., Lei, J., Tan, H., and Bansal, M. Unifying vision- and-language tasks via text generation. arXiv preprint arXiv:2102.02779,

work page arXiv
[4]

Scaling Instruction-Finetuned Language Models

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Narang, S., Mishra, G., Yu, A., Zhao, V . Y ., Huang, Y ., Dai, A. M., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V ., and Wei, J. Scali...

work page internal anchor Pith review arXiv
[5]

EV A: Exploring the Limits of Masked Visual Repre- sentation Learning at Scale 2022

Fang, Y ., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., and Cao, Y . Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636,

work page arXiv
[6]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models. arXiv prepr...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Jia, C., Yang, Y ., Xia, Y ., Chen, Y .-T., Parekh, Z., Pham, H., Le, Q. V ., Sung, Y ., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918,

work page arXiv
[8]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A. Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,

work page internal anchor Pith review arXiv
[11]

Vlmo: Unified vision- language pre-training with mixture-of-modality-experts,

Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. OFA: unifying architec- tures, tasks, and modalities through a simple sequence-to- sequence learning framework. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv´ari, C., Niu, G., and Sabato, S. (eds.), ICML, pp. 23318–23340, 2022a. Wang, W., Bao, H., Dong, L., ...

work page arXiv
[12]

CoCa: Contrastive Captioners are Image-Text Foundation Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Yu, J., Wang, Z., Vasudevan, V ., Yeung, L., Seyedhosseini, M., and Wu, Y . Coca: Contrastive captioners are image- text foundation models. arXiv preprint arXiv:2205.01917,

work page internal anchor Pith review arXiv
[13]

Florence: A New Foundation Model for Computer Vision

Yuan, L., Chen, D., Chen, Y ., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y ., Shi, Y ., Wang, L., Wang, J., Xiao, B., Xiao, Z., Yang, J., Zeng, M., Zhou, L., and Zhang, P. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432,

work page internal anchor Pith review arXiv
[14]

Vinvl: Making visual representations matter in vision- language models

Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y ., and Gao, J. Vinvl: Making visual representa- tions matter in vision-language models. arXiv preprint arXiv:2101.00529,

work page arXiv
[15]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M. T., Li, X., Lin, X. V ., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L. OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Has- son, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., and Si- monyan, K. Flami...

work page internal anchor Pith review arXiv

[2] [2]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Chen, J., Guo, H., Yi, K., Li, B., and Elhoseiny, M. Visu- algpt: Data-efficient adaptation of pretrained language models for image captioning. InCVPR, pp. 18009–18019, 2022a. Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A. J., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K...

work page internal anchor Pith review arXiv

[3] [3]

Unifying vision-and-language tasks via text generation

Cho, J., Lei, J., Tan, H., and Bansal, M. Unifying vision- and-language tasks via text generation. arXiv preprint arXiv:2102.02779,

work page arXiv

[4] [4]

Scaling Instruction-Finetuned Language Models

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Narang, S., Mishra, G., Yu, A., Zhao, V . Y ., Huang, Y ., Dai, A. M., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V ., and Wei, J. Scali...

work page internal anchor Pith review arXiv

[5] [5]

EV A: Exploring the Limits of Masked Visual Repre- sentation Learning at Scale 2022

Fang, Y ., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., and Cao, Y . Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636,

work page arXiv

[6] [6]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models. arXiv prepr...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Jia, C., Yang, Y ., Xia, Y ., Chen, Y .-T., Parekh, Z., Pham, H., Le, Q. V ., Sung, Y ., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918,

work page arXiv

[8] [8]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A. Laion-400m: Open dataset of clip- filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,

work page internal anchor Pith review arXiv

[11] [11]

Vlmo: Unified vision- language pre-training with mixture-of-modality-experts,

Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., and Yang, H. OFA: unifying architec- tures, tasks, and modalities through a simple sequence-to- sequence learning framework. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv´ari, C., Niu, G., and Sabato, S. (eds.), ICML, pp. 23318–23340, 2022a. Wang, W., Bao, H., Dong, L., ...

work page arXiv

[12] [12]

CoCa: Contrastive Captioners are Image-Text Foundation Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Yu, J., Wang, Z., Vasudevan, V ., Yeung, L., Seyedhosseini, M., and Wu, Y . Coca: Contrastive captioners are image- text foundation models. arXiv preprint arXiv:2205.01917,

work page internal anchor Pith review arXiv

[13] [13]

Florence: A New Foundation Model for Computer Vision

Yuan, L., Chen, D., Chen, Y ., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y ., Shi, Y ., Wang, L., Wang, J., Xiao, B., Xiao, Z., Yang, J., Zeng, M., Zhou, L., and Zhang, P. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432,

work page internal anchor Pith review arXiv

[14] [14]

Vinvl: Making visual representations matter in vision- language models

Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y ., and Gao, J. Vinvl: Making visual representa- tions matter in vision-language models. arXiv preprint arXiv:2101.00529,

work page arXiv

[15] [15]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M. T., Li, X., Lin, X. V ., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L. OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv