Vlmo: Uniﬁed vision-language pre- training with mixture-of-modality-experts

· 2021 · arXiv 2111.02358

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

cs.CV · 2023-03-08 · accept · novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

cs.CV · 2023-01-30 · unverdicted · novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

cs.CL · 2024-11-07 · conditional · novelty 6.0

MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.

CoCa: Contrastive Captioners are Image-Text Foundation Models

cs.CV · 2022-05-04 · accept · novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

Recent Advances in Multimodal Affective Computing: An NLP Perspective

cs.CL · 2024-09-11 · unverdicted · novelty 3.0

Survey organizing multimodal affective computing research around four NLP tasks, method paradigms, datasets, evaluation protocols, and future directions while releasing a resource repository.

citing papers explorer

Showing 6 of 6 citing papers.

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models cs.CV · 2023-03-08 · accept · none · ref 3
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models cs.CV · 2023-01-30 · unverdicted · none · ref 11
BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 122
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models cs.CL · 2024-11-07 · conditional · none · ref 4
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
CoCa: Contrastive Captioners are Image-Text Foundation Models cs.CV · 2022-05-04 · accept · none · ref 30
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
Recent Advances in Multimodal Affective Computing: An NLP Perspective cs.CL · 2024-09-11 · unverdicted · none · ref 215
Survey organizing multimodal affective computing research around four NLP tasks, method paradigms, datasets, evaluation protocols, and future directions while releasing a resource repository.

Vlmo: Uniﬁed vision-language pre- training with mixture-of-modality-experts

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer