Unifying vision-and-language tasks via text generation

Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal · 2021 · arXiv 2102.02779

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

cs.CV · 2023-05-11 · conditional · novelty 7.0

Instruction tuning of BLIP-2 with an instruction-aware Query Transformer delivers state-of-the-art zero-shot performance on held-out vision-language datasets and strong finetuned results on downstream tasks.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

cs.CV · 2023-01-30 · unverdicted · novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.

AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.

citing papers explorer

Showing 3 of 3 citing papers.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning cs.CV · 2023-05-11 · conditional · none · ref 6
Instruction tuning of BLIP-2 with an instruction-aware Query Transformer delivers state-of-the-art zero-shot performance on held-out vision-language datasets and strong finetuned results on downstream tasks.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models cs.CV · 2023-01-30 · unverdicted · none · ref 3
BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning cs.CV · 2026-04-16 · unverdicted · none · ref 9
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.

Unifying vision-and-language tasks via text generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer