pith. machine review for the scientific record. sign in

arxiv: 1908.03557 · v1 · submitted 2019-08-09 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

VisualBERT: A Simple and Performant Baseline for Vision and Language

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:56 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords VisualBERTvision and languageTransformerself-attentionpre-trainingvisual groundingVQAimage captioning
0
0 comments X

The pith

VisualBERT uses Transformer self-attention to align text tokens with image regions from caption data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VisualBERT as a stack of Transformer layers that processes text tokens and image region features together in a single sequence. Self-attention within these layers creates implicit alignments between words and visual elements during both pre-training and downstream use. Two language modeling objectives that incorporate visual information allow pre-training on ordinary image caption datasets. On four standard vision-and-language benchmarks the resulting model matches or exceeds the performance of more elaborate systems while requiring far less custom engineering.

Core claim

VisualBERT is a stack of Transformer layers that takes text tokens and image region features as input and uses self-attention to create implicit alignments between them. Pre-training with visually-grounded language model objectives on image caption data allows it to achieve strong performance on downstream vision-and-language tasks such as VQA, VCR, NLVR2, and image retrieval on Flickr30K, while being simpler than prior approaches.

What carries the argument

The stack of Transformer layers whose self-attention implicitly aligns text tokens with image regions.

If this is right

  • A single architecture and pre-training recipe suffices for multiple vision-and-language tasks without task-specific modules.
  • Grounding of language elements to image regions emerges without explicit supervision.
  • The model tracks syntactic relationships such as verb-argument associations in images.
  • Model complexity can be reduced while retaining competitive accuracy on visual reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same implicit-alignment approach could be tested on video-text or audio-text pairs to check generality across modalities.
  • Scaling the caption pre-training data further might improve robustness on tasks requiring fine spatial reasoning.
  • The observed syntactic sensitivity opens the possibility of using the model for visual parsing or scene-graph generation without additional labels.

Load-bearing premise

Self-attention inside the Transformer layers can learn useful alignments between text tokens and image regions from caption data alone without any explicit grounding supervision.

What would settle it

A controlled test in which VisualBERT fails to associate verbs with the image regions that depict their arguments would falsify the claim of implicit syntactic grounding.

read the original abstract

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VisualBERT, a stack of Transformer layers that uses self-attention to implicitly align text tokens with image regions. It is pre-trained on image caption data using two visually-grounded language modeling objectives and evaluated on VQA, VCR, NLVR2, and Flickr30K, where it matches or exceeds prior state-of-the-art models while remaining simpler. Additional analysis claims that the model grounds language elements to image regions without explicit supervision and tracks syntactic relationships such as verb-argument associations.

Significance. If the central claims hold after addressing the controls below, the work supplies a clean, reproducible baseline that isolates the contribution of Transformer self-attention to cross-modal alignment. The grounding analysis, if validated, would be a useful empirical observation for the community studying how multimodal pre-training induces implicit correspondences.

major comments (2)
  1. [§4.2] §4.2 (Pre-training objectives): The manuscript does not report an ablation that replaces the two visually-grounded objectives with standard text-only BERT pre-training and then re-measures both downstream task performance and the verb-argument attention patterns shown in §5.3. Without this control, it remains possible that the observed alignments are induced by the image-text matching and masked-region modeling terms rather than by self-attention on caption data alone; this directly affects the central claim of implicit grounding without explicit supervision.
  2. [§5.1] §5.1 (Experimental setup): The comparisons to prior work on VQA, VCR, NLVR2, and Flickr30K do not state whether all models were pre-trained on identical caption corpora and with comparable compute budgets. Because the abstract emphasizes that VisualBERT is “significantly simpler,” the absence of parameter counts, FLOPs, or training-time tables makes it difficult to evaluate the fairness of the baseline claim.
minor comments (2)
  1. [Figure 2] Figure 2: The caption should explicitly state which pre-training objective was active when the attention maps were generated.
  2. [§3.1] §3.1: The notation for the combined text-image input sequence is introduced without a diagram; adding a small schematic would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Pre-training objectives): The manuscript does not report an ablation that replaces the two visually-grounded objectives with standard text-only BERT pre-training and then re-measures both downstream task performance and the verb-argument attention patterns shown in §5.3. Without this control, it remains possible that the observed alignments are induced by the image-text matching and masked-region modeling terms rather than by self-attention on caption data alone; this directly affects the central claim of implicit grounding without explicit supervision.

    Authors: We agree that the requested ablation would help isolate the role of the visually-grounded objectives versus the self-attention mechanism on paired caption data. We will add this experiment to the revised §4.2: a variant of VisualBERT will be pre-trained using only standard text-only masked language modeling on the caption texts (while retaining the multimodal architecture for downstream use), and we will report both downstream task performance and the corresponding verb-argument attention patterns from §5.3. This will clarify whether the observed implicit alignments require the visual pre-training signals. revision: yes

  2. Referee: [§5.1] §5.1 (Experimental setup): The comparisons to prior work on VQA, VCR, NLVR2, and Flickr30K do not state whether all models were pre-trained on identical caption corpora and with comparable compute budgets. Because the abstract emphasizes that VisualBERT is “significantly simpler,” the absence of parameter counts, FLOPs, or training-time tables makes it difficult to evaluate the fairness of the baseline claim.

    Authors: We will revise §5.1 and add a new table summarizing model sizes (parameter counts) for VisualBERT and the main baselines. We will also explicitly note the pre-training corpora used for each model (VisualBERT uses COCO and Conceptual Captions; we will reference the datasets reported in the original papers for the baselines). Exact FLOPs and wall-clock training times for every prior model are not feasible to recompute without full re-implementations, but we will report VisualBERT’s training configuration, hardware, and wall-clock time, and we will emphasize that the architectural simplicity (single Transformer stack with no task-specific modules) is the primary basis for the baseline claim. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on empirical pre-training and held-out evaluation, not self-referential derivation

full rationale

The paper defines VisualBERT as a Transformer stack with self-attention for implicit alignment and introduces two pre-training objectives on caption data; downstream results on VQA, VCR, NLVR2, and Flickr30K are measured on standard held-out splits after this pre-training. No equations or derivations are presented that reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing step rely on self-citation of an unverified uniqueness result. The grounding analysis is post-hoc attention inspection rather than a mathematical identity. The architecture and objectives are independent of the final benchmark numbers, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the standard Transformer architecture and BERT-style masked language modeling, extended with visual region inputs and caption-based pre-training; no new free parameters or invented entities are introduced beyond those already present in the base models.

axioms (2)
  • standard math Standard Transformer self-attention layers can process concatenated text and visual region embeddings
    Invoked when the paper states that a stack of Transformer layers implicitly aligns text and image regions.
  • domain assumption Pre-training on image-caption pairs transfers to downstream vision-language tasks
    Central to the claim that the two visually-grounded objectives produce useful representations.

pith-pipeline@v0.9.0 · 5440 in / 1380 out tokens · 38016 ms · 2026-05-15T23:56:11.486724+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

    cs.CV 2026-04 unverdicted novelty 8.0

    MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.

  2. Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

    cs.MM 2026-04 unverdicted novelty 7.0

    Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...

  3. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  4. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    cs.RO 2022-04 accept novelty 7.0

    SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.

  5. A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation

    cs.CL 2026-05 unverdicted novelty 6.0

    VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.

  6. Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    ToMA uses persistent homology on H0-death and lightweight H1-birth edges to align multimodal manifolds, delivering stable gains on remote sensing and consistent benefits on fashion retrieval.

  7. MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

    cs.LG 2026-02 unverdicted novelty 6.0

    MultiModalPFN extends TabPFN with modality projectors, a multi-head gated MLP, and cross-attention pooler to unify tabular and non-tabular inputs, outperforming prior methods on medical and general multimodal datasets.

  8. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    cs.CV 2024-03 unverdicted novelty 6.0

    MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

  9. Kosmos-2: Grounding Multimodal Large Language Models to the World

    cs.CL 2023-06 unverdicted novelty 6.0

    Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

  10. PaLM-E: An Embodied Multimodal Language Model

    cs.LG 2023-03 conditional novelty 6.0

    PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...

  11. CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    cs.CL 2020-02 unverdicted novelty 6.0

    CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.

  12. Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid

    cs.AI 2026-05 unverdicted novelty 5.0

    A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.

  13. ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting

    cs.CV 2026-04 unverdicted novelty 5.0

    ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.

  14. Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

    cs.CV 2026-04 unverdicted novelty 5.0

    Patch-level analysis of token attention patterns and semantic alignment detects LVLM hallucinations at up to 90% accuracy by identifying diffuse, non-localized grounding that global methods miss.

  15. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    cs.CV 2023-04 conditional novelty 5.0

    LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.

  16. Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

    cs.CV 2026-05 unverdicted novelty 4.0

    Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.

  17. Transformer Interpretability from Perspective of Attention and Gradient

    cs.AI 2026-05 unverdicted novelty 4.0

    A gradient-guiding technique for Transformer attention interpretation yields detailed feature maps and reveals imperceptible image class-rewriting attacks on Vision Transformers.

  18. Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection

    cs.CV 2026-04 unverdicted novelty 4.0

    Vision-language grounding shows high prompt sensitivity, with different wordings for the same object leading to distinct instance selections and text embeddings explaining only 34% of the disagreement.

  19. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    cs.CV 2023-09 conditional novelty 4.0

    GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 19 Pith papers · 9 internal anchors

  1. [1]

    Bottom-up and top-down attention for image captioning and visual question answering

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018

  2. [2]

    VQA : Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA : Visual question answering. In ICCV, 2015

  3. [3]

    MUREL : Multimodal relational reasoning for visual question answering

    Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. MUREL : Multimodal relational reasoning for visual question answering. In CVPR, 2019

  4. [5]

    What does BERT look at? an analysis of BERT 's attention

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does BERT look at? an analysis of BERT 's attention. BlackboxNLP, 2019

  5. [6]

    Stanford typed dependencies manual

    Marie-Catherine De Marneffe and Christopher D Manning. Stanford typed dependencies manual. Technical report, 2008

  6. [7]

    BERT: pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019

  7. [8]

    Deep biaffine attention for neural dependency parsing

    Timothy Dozat and Christopher D Manning. Deep biaffine attention for neural dependency parsing. ICLR, 2017

  8. [9]

    AllenNLP : A deep semantic natural language processing platform

    Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. AllenNLP : A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 2018

  9. [10]

    Detectron

    Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Doll\' a r, and Kaiming He. Detectron. https://github.com/facebookresearch/detectron, 2018

  10. [11]

    Making the V in VQA matter: Elevating the role of image understanding in V isual Q uestion A nswering

    Yash Goyal, Tejas Khot, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in V isual Q uestion A nswering. In CVPR, 2017

  11. [12]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016

  12. [13]

    Pythia v0.1: the Winning Entry to the VQA Challenge 2018

    Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. Pythia v0. 1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956, 2018

  13. [14]

    ViLBERT : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

    Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Lee. ViLBERT : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019

  14. [15]

    Image retrieval using scene graphs

    Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015

  15. [16]

    Deep visual-semantic alignments for generating image descriptions

    Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015

  16. [17]

    ReferItGame : Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame : Referring to objects in photographs of natural scenes. In EMNLP, 2014

  17. [18]

    Bilinear attention networks

    Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In NeurIPS, 2018

  18. [19]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015

  19. [20]

    Visual Genome : Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual Genome : Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123 0 (1): 0 32--73, 2017

  20. [21]

    Relation-aware graph attention network for visual question answering

    Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. Relation-aware graph attention network for visual question answering. ArXiv, abs/1903.12314, 2019

  21. [22]

    Microsoft COCO : Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft COCO : Common objects in context. In ECCV, 2014

  22. [23]

    Learning conditioned graph structures for interpretable visual question answering

    Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. Learning conditioned graph structures for interpretable visual question answering. In NeurIPS, 2018

  23. [24]

    Deep contextualized word representations

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018

  24. [25]

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015

  25. [26]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI, 2018

  26. [27]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI, 2019

  27. [28]

    Faster R-CNN : Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN : Towards real-time object detection with region proposal networks. In NeurIPS, 2015

  28. [29]

    Imagenet large scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015

  29. [30]

    A simple neural network module for relational reasoning

    Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In NeurIPS, 2017

  30. [31]

    Conceptual Captions : A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions : A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018

  31. [32]

    Towards VQA models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In CVPR, 2019

  32. [33]

    A corpus for reasoning about natural language grounded in photographs

    Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. ACL, 2019

  33. [34]

    VideoBert : A joint model for video and language representation learning

    Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. VideoBert : A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019

  34. [35]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

  35. [36]

    Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned

    Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ACL, 2019

  36. [38]

    Stacked attention networks for image question answering

    Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In CVPR, 2016

  37. [40]

    Deep modular co-attention networks for visual question answering

    Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In CVPR, 2019 b

  38. [41]

    From recognition to cognition: Visual commonsense reasoning

    Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019

  39. [42]

    CVPR , year=

    Image retrieval using scene graphs , author=. CVPR , year=

  40. [43]

    ICLR , year=

    What do you learn from context? probing for sentence structure in contextualized word representations , author=. ICLR , year=

  41. [44]

    CVPR , year=

    Visual dialog , author=. CVPR , year=

  42. [45]

    NeurIPS , year=

    Learning conditioned graph structures for interpretable visual question answering , author=. NeurIPS , year=

  43. [46]

    Multimodal Transformer with Multi-View Visual Representation for Image Captioning

    Multimodal Transformer with Multi-View Visual Representation for Image Captioning , author=. arXiv preprint arXiv:1905.07841 , year=

  44. [47]

    CVPR , year=

    Deep visual-semantic alignments for generating image descriptions , author=. CVPR , year=

  45. [48]

    Cadene, Remi and Ben-Younes, Hedi and Cord, Matthieu and Thome, Nicolas , booktitle=

  46. [49]

    NeurIPS , year=

    A simple neural network module for relational reasoning , author=. NeurIPS , year=

  47. [50]

    CVPR , year=

    Stacked attention networks for image question answering , author=. CVPR , year=

  48. [51]

    CVPR , year=

    Deep Modular Co-Attention Networks for Visual Question Answering , author=. CVPR , year=

  49. [52]

    CVPR , year=

    Baby talk: Understanding and generating image descriptions , author=. CVPR , year=

  50. [53]

    CVPR , year=

    Unsupervised textual grounding: Linking words to image concepts , author=. CVPR , year=

  51. [54]

    CVPR , year=

    Knowledge aided consistency for weakly supervised phrase grounding , author=. CVPR , year=

  52. [55]

    Gardner, Matt and Grus, Joel and Neumann, Mark and Tafjord, Oyvind and Dasigi, Pradeep and Liu, Nelson F and Peters, Matthew and Schmitz, Michael and Zettlemoyer, Luke , booktitle=

  53. [56]

    ICLR , year=

    Deep biaffine attention for neural dependency parsing , author=. ICLR , year=

  54. [57]

    Discriminative Learning of Open-Vocabulary Object Retrieval and Localization by Negative Phrase Augmentation

    Query-adaptive R-CNN for open-vocabulary object detection and retrieval , author=. arXiv preprint arXiv:1711.09509 , year=

  55. [58]

    NeurIPS , year=

    Interpretable and globally optimal prediction for textual grounding using image concepts , author=. NeurIPS , year=

  56. [59]

    Detectron , howpublished =

    Ross Girshick and Ilija Radosavovic and Georgia Gkioxari and Piotr Doll\'. Detectron , howpublished =

  57. [60]

    ECCV , year=

    Visual relationship detection with language priors , author=. ECCV , year=

  58. [61]

    Pattern Recognition , volume=

    A survey on still image based human action recognition , author=. Pattern Recognition , volume=

  59. [62]

    CVPR , year=

    Situation recognition: Visual semantic role labeling for image understanding , author=. CVPR , year=

  60. [63]

    NAACL-HLT , year=

    Grounded semantic role labeling , author=. NAACL-HLT , year=

  61. [64]

    Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian , booktitle=

  62. [65]

    Krishna, Ranjay and Zhu, Yuke and Groth, Oliver and Johnson, Justin and Hata, Kenji and Kravitz, Joshua and Chen, Stephanie and Kalantidis, Yannis and Li, Li-Jia and Shamma, David A and others , journal=

  63. [66]

    CVPR , year=

    Describing objects by their attributes , author=. CVPR , year=

  64. [67]

    CVPR , year=

    Scene graph generation by iterative message passing , author=. CVPR , year=

  65. [68]

    CVPR , year=

    Neural motifs: Scene graph parsing with global context , author=. CVPR , year=

  66. [69]

    Visual Semantic Role Labeling

    Visual semantic role labeling , author=. arXiv preprint arXiv:1505.04474 , year=

  67. [70]

    NeurIPS , pages=

    Distributed representations of words and phrases and their compositionality , author=. NeurIPS , pages=

  68. [71]

    and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E

    Liu, Nelson F. and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E. and Smith, Noah A. , title =. NAACL-HLT , year =

  69. [72]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=

  70. [73]

    Lu Jiasen and Batra Dhruv and Parikh Devi and Lee Lee , journal=

  71. [74]

    ICML , year=

    Show, attend and tell: Neural image caption generation with visual attention , author=. ICML , year=

  72. [75]

    CVPR , year=

    Generation and comprehension of unambiguous object descriptions , author=. CVPR , year=

  73. [76]

    ECCV , year=

    Modeling context in referring expressions , author=. ECCV , year=

  74. [77]

    CVPR , year=

    Phrase localization and visual relationship detection with comprehensive image-language cues , author=. CVPR , year=

  75. [78]

    Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara , booktitle=

  76. [79]

    CVPR , year=

    Ask me anything: Free-form visual question answering based on knowledge from external sources , author=. CVPR , year=

  77. [80]

    CVPR , year=

    Vizwiz grand challenge: Answering visual questions from blind people , author=. CVPR , year=

  78. [81]

    Visual Madlibs: Fill in the blank Image Generation and Question Answering

    Visual madlibs: Fill in the blank image generation and question answering , author=. arXiv preprint arXiv:1506.00278 , year=

  79. [82]

    CVPR , year=

    Visual7w: Grounded question answering in images , author=. CVPR , year=

  80. [83]

    Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

    Unifying visual-semantic embeddings with multimodal neural language models , author=. arXiv preprint arXiv:1411.2539 , year=

Showing first 80 references.