CVPR , year=

Visual7w: Grounded question answering in images , author=

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

cs.CV · 2024-12-18 · unverdicted · novelty 6.0

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

VisualBERT: A Simple and Performant Baseline for Vision and Language

cs.CV · 2019-08-09 · conditional · novelty 6.0

VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.

citing papers explorer

Showing 2 of 2 citing papers.

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning cs.CV · 2024-12-18 · unverdicted · none · ref 144
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
VisualBERT: A Simple and Performant Baseline for Vision and Language cs.CV · 2019-08-09 · conditional · none · ref 82
VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.

CVPR , year=

fields

years

verdicts

representative citing papers

citing papers explorer