Deep visual-semantic alignments for generating image descriptions

Andrej Karpathy, Li Fei-Fei · 2015

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

VisualBERT: A Simple and Performant Baseline for Vision and Language

cs.CV · 2019-08-09 · conditional · novelty 6.0

VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.

PaLI-X: On Scaling up a Multilingual Vision and Language Model

cs.CV · 2023-05-29 · unverdicted · novelty 4.0

Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.

citing papers explorer

Showing 2 of 2 citing papers.

VisualBERT: A Simple and Performant Baseline for Vision and Language cs.CV · 2019-08-09 · conditional · none · ref 16
VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.
PaLI-X: On Scaling up a Multilingual Vision and Language Model cs.CV · 2023-05-29 · unverdicted · none · ref 30
Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.

Deep visual-semantic alignments for generating image descriptions

fields

years

verdicts

representative citing papers

citing papers explorer