pith. sign in

Winner team Mia at TextVQA Challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

fields

cs.CV 4

years

2023 1 2022 3

roles

method 1

polarities

use method 1

representative citing papers

PaLI: A Jointly-Scaled Multilingual Language-Image Model

cs.CV · 2022-09-14 · conditional · novelty 7.0

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

GIT: A Generative Image-to-text Transformer for Vision and Language

cs.CV · 2022-05-27 · unverdicted · novelty 5.0

GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.

PaLI-X: On Scaling up a Multilingual Vision and Language Model

cs.CV · 2023-05-29 · unverdicted · novelty 4.0

Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.

citing papers explorer

Showing 4 of 4 citing papers.

  • PaLI: A Jointly-Scaled Multilingual Language-Image Model cs.CV · 2022-09-14 · conditional · none · ref 57

    PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

  • Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 85

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  • GIT: A Generative Image-to-text Transformer for Vision and Language cs.CV · 2022-05-27 · unverdicted · none · ref 23

    GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.

  • PaLI-X: On Scaling up a Multilingual Vision and Language Model cs.CV · 2023-05-29 · unverdicted · none · ref 52

    Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.