PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
Winner team Mia at TextVQA Challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4roles
method 1polarities
use method 1representative citing papers
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.
Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.
citing papers explorer
-
PaLI: A Jointly-Scaled Multilingual Language-Image Model
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
GIT: A Generative Image-to-text Transformer for Vision and Language
GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.
-
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.