arxiv: 1908.03557 · v1 · submitted 2019-08-09 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

VisualBERT: A Simple and Performant Baseline for Vision and Language

Liunian Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , Kai-Wei Chang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:56 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords VisualBERTvision and languageTransformerself-attentionpre-trainingvisual groundingVQAimage captioning

0 comments

The pith

VisualBERT uses Transformer self-attention to align text tokens with image regions from caption data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VisualBERT as a stack of Transformer layers that processes text tokens and image region features together in a single sequence. Self-attention within these layers creates implicit alignments between words and visual elements during both pre-training and downstream use. Two language modeling objectives that incorporate visual information allow pre-training on ordinary image caption datasets. On four standard vision-and-language benchmarks the resulting model matches or exceeds the performance of more elaborate systems while requiring far less custom engineering.

Core claim

VisualBERT is a stack of Transformer layers that takes text tokens and image region features as input and uses self-attention to create implicit alignments between them. Pre-training with visually-grounded language model objectives on image caption data allows it to achieve strong performance on downstream vision-and-language tasks such as VQA, VCR, NLVR2, and image retrieval on Flickr30K, while being simpler than prior approaches.

What carries the argument

The stack of Transformer layers whose self-attention implicitly aligns text tokens with image regions.

If this is right

A single architecture and pre-training recipe suffices for multiple vision-and-language tasks without task-specific modules.
Grounding of language elements to image regions emerges without explicit supervision.
The model tracks syntactic relationships such as verb-argument associations in images.
Model complexity can be reduced while retaining competitive accuracy on visual reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same implicit-alignment approach could be tested on video-text or audio-text pairs to check generality across modalities.
Scaling the caption pre-training data further might improve robustness on tasks requiring fine spatial reasoning.
The observed syntactic sensitivity opens the possibility of using the model for visual parsing or scene-graph generation without additional labels.

Load-bearing premise

Self-attention inside the Transformer layers can learn useful alignments between text tokens and image regions from caption data alone without any explicit grounding supervision.

What would settle it

A controlled test in which VisualBERT fails to associate verbs with the image regions that depict their arguments would falsify the claim of implicit syntactic grounding.

read the original abstract

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VisualBERT is a clean single-stack Transformer baseline that matches complex vision-language models on VQA, VCR, NLVR2 and retrieval after two caption pre-training tasks.

read the letter

The core contribution is a straightforward architecture: one Transformer stack that ingests text tokens and image regions in the same sequence and lets self-attention handle the cross-modal links. They add two pre-training objectives on image captions that encourage visual grounding, then fine-tune on four standard tasks. The reported numbers are competitive with heavier pipelines, and the analysis section shows the model tracks verb-argument alignments in the attention maps without explicit region labels. That combination of simplicity and empirical reach is the useful part. It gives people a single model they can start from instead of stitching separate vision and language modules together. The grounding analysis is a nice extra that suggests the implicit alignment is doing real work on the held-out data. The main soft spot is the lack of clear ablations on the pre-training objectives themselves. The stress-test note is right to flag that we do not yet know how much of the alignment comes from the two visually-grounded losses versus the bare self-attention architecture on raw captions. Replacing those objectives with ordinary BERT-style pre-training and re-running both the downstream tasks and the attention analysis would make the central claim sharper. The paper also does not spell out every hyper-parameter choice or baseline implementation detail in the abstract, so some of the gains could trace to tuning rather than the model shape. Still, the results are on public benchmarks and the architecture is reproducible in principle, so the evidence is solid enough to evaluate. This is the sort of paper that belongs in a reading group for anyone working on multi-modal models who wants a practical starting point. A reader who needs a simple, competitive baseline or wants to test whether unified Transformers can replace modular designs will get immediate value. It deserves a serious referee because the empirical footprint is large and the idea is easy to build on, even if the authors should add the missing ablations before final publication. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper introduces VisualBERT, a stack of Transformer layers that uses self-attention to implicitly align text tokens with image regions. It is pre-trained on image caption data using two visually-grounded language modeling objectives and evaluated on VQA, VCR, NLVR2, and Flickr30K, where it matches or exceeds prior state-of-the-art models while remaining simpler. Additional analysis claims that the model grounds language elements to image regions without explicit supervision and tracks syntactic relationships such as verb-argument associations.

Significance. If the central claims hold after addressing the controls below, the work supplies a clean, reproducible baseline that isolates the contribution of Transformer self-attention to cross-modal alignment. The grounding analysis, if validated, would be a useful empirical observation for the community studying how multimodal pre-training induces implicit correspondences.

major comments (2)

[§4.2] §4.2 (Pre-training objectives): The manuscript does not report an ablation that replaces the two visually-grounded objectives with standard text-only BERT pre-training and then re-measures both downstream task performance and the verb-argument attention patterns shown in §5.3. Without this control, it remains possible that the observed alignments are induced by the image-text matching and masked-region modeling terms rather than by self-attention on caption data alone; this directly affects the central claim of implicit grounding without explicit supervision.
[§5.1] §5.1 (Experimental setup): The comparisons to prior work on VQA, VCR, NLVR2, and Flickr30K do not state whether all models were pre-trained on identical caption corpora and with comparable compute budgets. Because the abstract emphasizes that VisualBERT is “significantly simpler,” the absence of parameter counts, FLOPs, or training-time tables makes it difficult to evaluate the fairness of the baseline claim.

minor comments (2)

[Figure 2] Figure 2: The caption should explicitly state which pre-training objective was active when the attention maps were generated.
[§3.1] §3.1: The notation for the combined text-image input sequence is introduced without a diagram; adding a small schematic would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [§4.2] §4.2 (Pre-training objectives): The manuscript does not report an ablation that replaces the two visually-grounded objectives with standard text-only BERT pre-training and then re-measures both downstream task performance and the verb-argument attention patterns shown in §5.3. Without this control, it remains possible that the observed alignments are induced by the image-text matching and masked-region modeling terms rather than by self-attention on caption data alone; this directly affects the central claim of implicit grounding without explicit supervision.

Authors: We agree that the requested ablation would help isolate the role of the visually-grounded objectives versus the self-attention mechanism on paired caption data. We will add this experiment to the revised §4.2: a variant of VisualBERT will be pre-trained using only standard text-only masked language modeling on the caption texts (while retaining the multimodal architecture for downstream use), and we will report both downstream task performance and the corresponding verb-argument attention patterns from §5.3. This will clarify whether the observed implicit alignments require the visual pre-training signals. revision: yes
Referee: [§5.1] §5.1 (Experimental setup): The comparisons to prior work on VQA, VCR, NLVR2, and Flickr30K do not state whether all models were pre-trained on identical caption corpora and with comparable compute budgets. Because the abstract emphasizes that VisualBERT is “significantly simpler,” the absence of parameter counts, FLOPs, or training-time tables makes it difficult to evaluate the fairness of the baseline claim.

Authors: We will revise §5.1 and add a new table summarizing model sizes (parameter counts) for VisualBERT and the main baselines. We will also explicitly note the pre-training corpora used for each model (VisualBERT uses COCO and Conceptual Captions; we will reference the datasets reported in the original papers for the baselines). Exact FLOPs and wall-clock training times for every prior model are not feasible to recompute without full re-implementations, but we will report VisualBERT’s training configuration, hardware, and wall-clock time, and we will emphasize that the architectural simplicity (single Transformer stack with no task-specific modules) is the primary basis for the baseline claim. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on empirical pre-training and held-out evaluation, not self-referential derivation

full rationale

The paper defines VisualBERT as a Transformer stack with self-attention for implicit alignment and introduces two pre-training objectives on caption data; downstream results on VQA, VCR, NLVR2, and Flickr30K are measured on standard held-out splits after this pre-training. No equations or derivations are presented that reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing step rely on self-citation of an unverified uniqueness result. The grounding analysis is post-hoc attention inspection rather than a mathematical identity. The architecture and objectives are independent of the final benchmark numbers, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the standard Transformer architecture and BERT-style masked language modeling, extended with visual region inputs and caption-based pre-training; no new free parameters or invented entities are introduced beyond those already present in the base models.

axioms (2)

standard math Standard Transformer self-attention layers can process concatenated text and visual region embeddings
Invoked when the paper states that a stack of Transformer layers implicitly aligns text and image regions.
domain assumption Pre-training on image-caption pairs transfers to downstream vision-language tasks
Central to the claim that the two visually-grounded objectives produce useful representations.

pith-pipeline@v0.9.0 · 5440 in / 1380 out tokens · 38016 ms · 2026-05-15T23:56:11.486724+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
cs.CV 2026-04 unverdicted novelty 8.0

MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
cs.MM 2026-04 unverdicted novelty 7.0

Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
cs.RO 2022-04 accept novelty 7.0

SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
cs.CL 2026-05 unverdicted novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

ToMA uses persistent homology on H0-death and lightweight H1-birth edges to align multimodal manifolds, delivering stable gains on remote sensing and consistent benefits on fashion retrieval.
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
cs.LG 2026-02 unverdicted novelty 6.0

MultiModalPFN extends TabPFN with modality projectors, a multi-head gated MLP, and cross-attention pooler to unify tabular and non-tabular inputs, outperforming prior methods on medical and general multimodal datasets.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Kosmos-2: Grounding Multimodal Large Language Models to the World
cs.CL 2023-06 unverdicted novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
PaLM-E: An Embodied Multimodal Language Model
cs.LG 2023-03 conditional novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
cs.CL 2020-02 unverdicted novelty 6.0

CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid
cs.AI 2026-05 unverdicted novelty 5.0

A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.
ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting
cs.CV 2026-04 unverdicted novelty 5.0

ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
cs.CV 2026-04 unverdicted novelty 5.0

Patch-level analysis of token attention patterns and semantic alignment detects LVLM hallucinations at up to 90% accuracy by identifying diffuse, non-localized grounding that global methods miss.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
cs.CV 2023-04 conditional novelty 5.0

LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation
cs.CV 2026-05 unverdicted novelty 4.0

Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.
Transformer Interpretability from Perspective of Attention and Gradient
cs.AI 2026-05 unverdicted novelty 4.0

A gradient-guiding technique for Transformer attention interpretation yields detailed feature maps and reveals imperceptible image class-rewriting attacks on Vision Transformers.
Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection
cs.CV 2026-04 unverdicted novelty 4.0

Vision-language grounding shows high prompt sensitivity, with different wordings for the same object leading to distinct instance selections and text embeddings explaining only 34% of the disagreement.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
cs.CV 2023-09 conditional novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018

work page 2018
[2]

VQA : Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA : Visual question answering. In ICCV, 2015

work page 2015
[3]

MUREL : Multimodal relational reasoning for visual question answering

Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. MUREL : Multimodal relational reasoning for visual question answering. In CVPR, 2019

work page 2019
[5]

What does BERT look at? an analysis of BERT 's attention

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does BERT look at? an analysis of BERT 's attention. BlackboxNLP, 2019

work page 2019
[6]

Stanford typed dependencies manual

Marie-Catherine De Marneffe and Christopher D Manning. Stanford typed dependencies manual. Technical report, 2008

work page 2008
[7]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019

work page 2019
[8]

Deep biaffine attention for neural dependency parsing

Timothy Dozat and Christopher D Manning. Deep biaffine attention for neural dependency parsing. ICLR, 2017

work page 2017
[9]

AllenNLP : A deep semantic natural language processing platform

Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. AllenNLP : A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 2018

work page 2018
[10]

Detectron

Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Doll\' a r, and Kaiming He. Detectron. https://github.com/facebookresearch/detectron, 2018

work page 2018
[11]

Making the V in VQA matter: Elevating the role of image understanding in V isual Q uestion A nswering

Yash Goyal, Tejas Khot, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in V isual Q uestion A nswering. In CVPR, 2017

work page 2017
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016

work page 2016
[13]

Pythia v0.1: the Winning Entry to the VQA Challenge 2018

Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. Pythia v0. 1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

ViLBERT : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Lee. ViLBERT : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019

work page arXiv 1908
[15]

Image retrieval using scene graphs

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015

work page 2015
[16]

Deep visual-semantic alignments for generating image descriptions

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015

work page 2015
[17]

ReferItGame : Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame : Referring to objects in photographs of natural scenes. In EMNLP, 2014

work page 2014
[18]

Bilinear attention networks

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In NeurIPS, 2018

work page 2018
[19]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015

work page 2015
[20]

Visual Genome : Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual Genome : Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123 0 (1): 0 32--73, 2017

work page 2017
[21]

Relation-aware graph attention network for visual question answering

Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. Relation-aware graph attention network for visual question answering. ArXiv, abs/1903.12314, 2019

work page arXiv 1903
[22]

Microsoft COCO : Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft COCO : Common objects in context. In ECCV, 2014

work page 2014
[23]

Learning conditioned graph structures for interpretable visual question answering

Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. Learning conditioned graph structures for interpretable visual question answering. In NeurIPS, 2018

work page 2018
[24]

Deep contextualized word representations

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018

work page 2018
[25]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015

work page 2015
[26]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI, 2018

work page 2018
[27]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI, 2019

work page 2019
[28]

Faster R-CNN : Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN : Towards real-time object detection with region proposal networks. In NeurIPS, 2015

work page 2015
[29]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015

work page 2015
[30]

A simple neural network module for relational reasoning

Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In NeurIPS, 2017

work page 2017
[31]

Conceptual Captions : A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions : A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018

work page 2018
[32]

Towards VQA models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In CVPR, 2019

work page 2019
[33]

A corpus for reasoning about natural language grounded in photographs

Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. ACL, 2019

work page 2019
[34]

VideoBert : A joint model for video and language representation learning

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. VideoBert : A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019

work page arXiv 1904
[35]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

work page 2017
[36]

Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ACL, 2019

work page 2019
[38]

Stacked attention networks for image question answering

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In CVPR, 2016

work page 2016
[40]

Deep modular co-attention networks for visual question answering

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In CVPR, 2019 b

work page 2019
[41]

From recognition to cognition: Visual commonsense reasoning

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019

work page 2019
[42]

CVPR , year=

Image retrieval using scene graphs , author=. CVPR , year=

work page
[43]

ICLR , year=

What do you learn from context? probing for sentence structure in contextualized word representations , author=. ICLR , year=

work page
[44]

CVPR , year=

Visual dialog , author=. CVPR , year=

work page
[45]

NeurIPS , year=

Learning conditioned graph structures for interpretable visual question answering , author=. NeurIPS , year=

work page
[46]

Multimodal Transformer with Multi-View Visual Representation for Image Captioning

Multimodal Transformer with Multi-View Visual Representation for Image Captioning , author=. arXiv preprint arXiv:1905.07841 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[47]

CVPR , year=

Deep visual-semantic alignments for generating image descriptions , author=. CVPR , year=

work page
[48]

Cadene, Remi and Ben-Younes, Hedi and Cord, Matthieu and Thome, Nicolas , booktitle=

work page
[49]

NeurIPS , year=

A simple neural network module for relational reasoning , author=. NeurIPS , year=

work page
[50]

CVPR , year=

Stacked attention networks for image question answering , author=. CVPR , year=

work page
[51]

CVPR , year=

Deep Modular Co-Attention Networks for Visual Question Answering , author=. CVPR , year=

work page
[52]

CVPR , year=

Baby talk: Understanding and generating image descriptions , author=. CVPR , year=

work page
[53]

CVPR , year=

Unsupervised textual grounding: Linking words to image concepts , author=. CVPR , year=

work page
[54]

CVPR , year=

Knowledge aided consistency for weakly supervised phrase grounding , author=. CVPR , year=

work page
[55]

Gardner, Matt and Grus, Joel and Neumann, Mark and Tafjord, Oyvind and Dasigi, Pradeep and Liu, Nelson F and Peters, Matthew and Schmitz, Michael and Zettlemoyer, Luke , booktitle=

work page
[56]

ICLR , year=

Deep biaffine attention for neural dependency parsing , author=. ICLR , year=

work page
[57]

Discriminative Learning of Open-Vocabulary Object Retrieval and Localization by Negative Phrase Augmentation

Query-adaptive R-CNN for open-vocabulary object detection and retrieval , author=. arXiv preprint arXiv:1711.09509 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

NeurIPS , year=

Interpretable and globally optimal prediction for textual grounding using image concepts , author=. NeurIPS , year=

work page
[59]

Detectron , howpublished =

Ross Girshick and Ilija Radosavovic and Georgia Gkioxari and Piotr Doll\'. Detectron , howpublished =

work page
[60]

ECCV , year=

Visual relationship detection with language priors , author=. ECCV , year=

work page
[61]

Pattern Recognition , volume=

A survey on still image based human action recognition , author=. Pattern Recognition , volume=

work page
[62]

CVPR , year=

Situation recognition: Visual semantic role labeling for image understanding , author=. CVPR , year=

work page
[63]

NAACL-HLT , year=

Grounded semantic role labeling , author=. NAACL-HLT , year=

work page
[64]

Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian , booktitle=

work page
[65]

Krishna, Ranjay and Zhu, Yuke and Groth, Oliver and Johnson, Justin and Hata, Kenji and Kravitz, Joshua and Chen, Stephanie and Kalantidis, Yannis and Li, Li-Jia and Shamma, David A and others , journal=

work page
[66]

CVPR , year=

Describing objects by their attributes , author=. CVPR , year=

work page
[67]

CVPR , year=

Scene graph generation by iterative message passing , author=. CVPR , year=

work page
[68]

CVPR , year=

Neural motifs: Scene graph parsing with global context , author=. CVPR , year=

work page
[69]

Visual Semantic Role Labeling

Visual semantic role labeling , author=. arXiv preprint arXiv:1505.04474 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

NeurIPS , pages=

Distributed representations of words and phrases and their compositionality , author=. NeurIPS , pages=

work page
[71]

and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E

Liu, Nelson F. and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E. and Smith, Noah A. , title =. NAACL-HLT , year =

work page
[72]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Lu Jiasen and Batra Dhruv and Parikh Devi and Lee Lee , journal=

work page
[74]

ICML , year=

Show, attend and tell: Neural image caption generation with visual attention , author=. ICML , year=

work page
[75]

CVPR , year=

Generation and comprehension of unambiguous object descriptions , author=. CVPR , year=

work page
[76]

ECCV , year=

Modeling context in referring expressions , author=. ECCV , year=

work page
[77]

CVPR , year=

Phrase localization and visual relationship detection with comprehensive image-language cues , author=. CVPR , year=

work page
[78]

Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara , booktitle=

work page
[79]

CVPR , year=

Ask me anything: Free-form visual question answering based on knowledge from external sources , author=. CVPR , year=

work page
[80]

CVPR , year=

Vizwiz grand challenge: Answering visual questions from blind people , author=. CVPR , year=

work page
[81]

Visual Madlibs: Fill in the blank Image Generation and Question Answering

Visual madlibs: Fill in the blank image generation and question answering , author=. arXiv preprint arXiv:1506.00278 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[82]

CVPR , year=

Visual7w: Grounded question answering in images , author=. CVPR , year=

work page
[83]

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Unifying visual-semantic embeddings with multimodal neural language models , author=. arXiv preprint arXiv:1411.2539 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.