Recognition: 2 theorem links
· Lean TheoremVisualBERT: A Simple and Performant Baseline for Vision and Language
Pith reviewed 2026-05-15 23:56 UTC · model grok-4.3
The pith
VisualBERT uses Transformer self-attention to align text tokens with image regions from caption data alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VisualBERT is a stack of Transformer layers that takes text tokens and image region features as input and uses self-attention to create implicit alignments between them. Pre-training with visually-grounded language model objectives on image caption data allows it to achieve strong performance on downstream vision-and-language tasks such as VQA, VCR, NLVR2, and image retrieval on Flickr30K, while being simpler than prior approaches.
What carries the argument
The stack of Transformer layers whose self-attention implicitly aligns text tokens with image regions.
If this is right
- A single architecture and pre-training recipe suffices for multiple vision-and-language tasks without task-specific modules.
- Grounding of language elements to image regions emerges without explicit supervision.
- The model tracks syntactic relationships such as verb-argument associations in images.
- Model complexity can be reduced while retaining competitive accuracy on visual reasoning benchmarks.
Where Pith is reading between the lines
- The same implicit-alignment approach could be tested on video-text or audio-text pairs to check generality across modalities.
- Scaling the caption pre-training data further might improve robustness on tasks requiring fine spatial reasoning.
- The observed syntactic sensitivity opens the possibility of using the model for visual parsing or scene-graph generation without additional labels.
Load-bearing premise
Self-attention inside the Transformer layers can learn useful alignments between text tokens and image regions from caption data alone without any explicit grounding supervision.
What would settle it
A controlled test in which VisualBERT fails to associate verbs with the image regions that depict their arguments would falsify the claim of implicit syntactic grounding.
read the original abstract
We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VisualBERT, a stack of Transformer layers that uses self-attention to implicitly align text tokens with image regions. It is pre-trained on image caption data using two visually-grounded language modeling objectives and evaluated on VQA, VCR, NLVR2, and Flickr30K, where it matches or exceeds prior state-of-the-art models while remaining simpler. Additional analysis claims that the model grounds language elements to image regions without explicit supervision and tracks syntactic relationships such as verb-argument associations.
Significance. If the central claims hold after addressing the controls below, the work supplies a clean, reproducible baseline that isolates the contribution of Transformer self-attention to cross-modal alignment. The grounding analysis, if validated, would be a useful empirical observation for the community studying how multimodal pre-training induces implicit correspondences.
major comments (2)
- [§4.2] §4.2 (Pre-training objectives): The manuscript does not report an ablation that replaces the two visually-grounded objectives with standard text-only BERT pre-training and then re-measures both downstream task performance and the verb-argument attention patterns shown in §5.3. Without this control, it remains possible that the observed alignments are induced by the image-text matching and masked-region modeling terms rather than by self-attention on caption data alone; this directly affects the central claim of implicit grounding without explicit supervision.
- [§5.1] §5.1 (Experimental setup): The comparisons to prior work on VQA, VCR, NLVR2, and Flickr30K do not state whether all models were pre-trained on identical caption corpora and with comparable compute budgets. Because the abstract emphasizes that VisualBERT is “significantly simpler,” the absence of parameter counts, FLOPs, or training-time tables makes it difficult to evaluate the fairness of the baseline claim.
minor comments (2)
- [Figure 2] Figure 2: The caption should explicitly state which pre-training objective was active when the attention maps were generated.
- [§3.1] §3.1: The notation for the combined text-image input sequence is introduced without a diagram; adding a small schematic would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Pre-training objectives): The manuscript does not report an ablation that replaces the two visually-grounded objectives with standard text-only BERT pre-training and then re-measures both downstream task performance and the verb-argument attention patterns shown in §5.3. Without this control, it remains possible that the observed alignments are induced by the image-text matching and masked-region modeling terms rather than by self-attention on caption data alone; this directly affects the central claim of implicit grounding without explicit supervision.
Authors: We agree that the requested ablation would help isolate the role of the visually-grounded objectives versus the self-attention mechanism on paired caption data. We will add this experiment to the revised §4.2: a variant of VisualBERT will be pre-trained using only standard text-only masked language modeling on the caption texts (while retaining the multimodal architecture for downstream use), and we will report both downstream task performance and the corresponding verb-argument attention patterns from §5.3. This will clarify whether the observed implicit alignments require the visual pre-training signals. revision: yes
-
Referee: [§5.1] §5.1 (Experimental setup): The comparisons to prior work on VQA, VCR, NLVR2, and Flickr30K do not state whether all models were pre-trained on identical caption corpora and with comparable compute budgets. Because the abstract emphasizes that VisualBERT is “significantly simpler,” the absence of parameter counts, FLOPs, or training-time tables makes it difficult to evaluate the fairness of the baseline claim.
Authors: We will revise §5.1 and add a new table summarizing model sizes (parameter counts) for VisualBERT and the main baselines. We will also explicitly note the pre-training corpora used for each model (VisualBERT uses COCO and Conceptual Captions; we will reference the datasets reported in the original papers for the baselines). Exact FLOPs and wall-clock training times for every prior model are not feasible to recompute without full re-implementations, but we will report VisualBERT’s training configuration, hardware, and wall-clock time, and we will emphasize that the architectural simplicity (single Transformer stack with no task-specific modules) is the primary basis for the baseline claim. revision: partial
Circularity Check
No circularity: claims rest on empirical pre-training and held-out evaluation, not self-referential derivation
full rationale
The paper defines VisualBERT as a Transformer stack with self-attention for implicit alignment and introduces two pre-training objectives on caption data; downstream results on VQA, VCR, NLVR2, and Flickr30K are measured on standard held-out splits after this pre-training. No equations or derivations are presented that reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing step rely on self-citation of an unverified uniqueness result. The grounding analysis is post-hoc attention inspection rather than a mathematical identity. The architecture and objectives are independent of the final benchmark numbers, satisfying the criteria for a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard Transformer self-attention layers can process concatenated text and visual region embeddings
- domain assumption Pre-training on image-caption pairs transfers to downstream vision-language tasks
Forward citations
Cited by 19 Pith papers
-
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
-
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pa...
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
-
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning
ToMA uses persistent homology on H0-death and lightweight H1-birth edges to align multimodal manifolds, delivering stable gains on remote sensing and consistent benefits on fashion retrieval.
-
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
MultiModalPFN extends TabPFN with modality projectors, a multi-head gated MLP, and cross-attention pooler to unify tabular and non-tabular inputs, outperforming prior methods on medical and general multimodal datasets.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
-
PaLM-E: An Embodied Multimodal Language Model
PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
-
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
-
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid
A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.
-
ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting
ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.
-
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
Patch-level analysis of token attention patterns and semantic alignment detects LVLM hallucinations at up to 90% accuracy by identifying diffuse, non-localized grounding that global methods miss.
-
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
-
Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation
Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.
-
Transformer Interpretability from Perspective of Attention and Gradient
A gradient-guiding technique for Transformer attention interpretation yields detailed feature maps and reveals imperceptible image class-rewriting attacks on Vision Transformers.
-
Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection
Vision-language grounding shows high prompt sensitivity, with different wordings for the same object leading to distinct instance selections and text embeddings explaining only 34% of the disagreement.
-
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
Reference graph
Works this paper leans on
-
[1]
Bottom-up and top-down attention for image captioning and visual question answering
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018
work page 2018
-
[2]
VQA : Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA : Visual question answering. In ICCV, 2015
work page 2015
-
[3]
MUREL : Multimodal relational reasoning for visual question answering
Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. MUREL : Multimodal relational reasoning for visual question answering. In CVPR, 2019
work page 2019
-
[5]
What does BERT look at? an analysis of BERT 's attention
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does BERT look at? an analysis of BERT 's attention. BlackboxNLP, 2019
work page 2019
-
[6]
Stanford typed dependencies manual
Marie-Catherine De Marneffe and Christopher D Manning. Stanford typed dependencies manual. Technical report, 2008
work page 2008
-
[7]
BERT: pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019
work page 2019
-
[8]
Deep biaffine attention for neural dependency parsing
Timothy Dozat and Christopher D Manning. Deep biaffine attention for neural dependency parsing. ICLR, 2017
work page 2017
-
[9]
AllenNLP : A deep semantic natural language processing platform
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. AllenNLP : A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 2018
work page 2018
- [10]
-
[11]
Yash Goyal, Tejas Khot, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in V isual Q uestion A nswering. In CVPR, 2017
work page 2017
-
[12]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016
work page 2016
-
[13]
Pythia v0.1: the Winning Entry to the VQA Challenge 2018
Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. Pythia v0. 1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
ViLBERT : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Lee. ViLBERT : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019
-
[15]
Image retrieval using scene graphs
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015
work page 2015
-
[16]
Deep visual-semantic alignments for generating image descriptions
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015
work page 2015
-
[17]
ReferItGame : Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame : Referring to objects in photographs of natural scenes. In EMNLP, 2014
work page 2014
-
[18]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In NeurIPS, 2018
work page 2018
-
[19]
Adam: A method for stochastic optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015
work page 2015
-
[20]
Visual Genome : Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual Genome : Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123 0 (1): 0 32--73, 2017
work page 2017
-
[21]
Relation-aware graph attention network for visual question answering
Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. Relation-aware graph attention network for visual question answering. ArXiv, abs/1903.12314, 2019
-
[22]
Microsoft COCO : Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft COCO : Common objects in context. In ECCV, 2014
work page 2014
-
[23]
Learning conditioned graph structures for interpretable visual question answering
Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. Learning conditioned graph structures for interpretable visual question answering. In NeurIPS, 2018
work page 2018
-
[24]
Deep contextualized word representations
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018
work page 2018
-
[25]
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015
work page 2015
-
[26]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI, 2018
work page 2018
-
[27]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI, 2019
work page 2019
-
[28]
Faster R-CNN : Towards real-time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN : Towards real-time object detection with region proposal networks. In NeurIPS, 2015
work page 2015
-
[29]
Imagenet large scale visual recognition challenge
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015
work page 2015
-
[30]
A simple neural network module for relational reasoning
Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In NeurIPS, 2017
work page 2017
-
[31]
Conceptual Captions : A cleaned, hypernymed, image alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions : A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018
work page 2018
-
[32]
Towards VQA models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In CVPR, 2019
work page 2019
-
[33]
A corpus for reasoning about natural language grounded in photographs
Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. ACL, 2019
work page 2019
-
[34]
VideoBert : A joint model for video and language representation learning
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. VideoBert : A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019
-
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017
work page 2017
-
[36]
Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ACL, 2019
work page 2019
-
[38]
Stacked attention networks for image question answering
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In CVPR, 2016
work page 2016
-
[40]
Deep modular co-attention networks for visual question answering
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In CVPR, 2019 b
work page 2019
-
[41]
From recognition to cognition: Visual commonsense reasoning
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019
work page 2019
- [42]
-
[43]
What do you learn from context? probing for sentence structure in contextualized word representations , author=. ICLR , year=
- [44]
-
[45]
Learning conditioned graph structures for interpretable visual question answering , author=. NeurIPS , year=
-
[46]
Multimodal Transformer with Multi-View Visual Representation for Image Captioning
Multimodal Transformer with Multi-View Visual Representation for Image Captioning , author=. arXiv preprint arXiv:1905.07841 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[47]
Deep visual-semantic alignments for generating image descriptions , author=. CVPR , year=
-
[48]
Cadene, Remi and Ben-Younes, Hedi and Cord, Matthieu and Thome, Nicolas , booktitle=
-
[49]
A simple neural network module for relational reasoning , author=. NeurIPS , year=
-
[50]
Stacked attention networks for image question answering , author=. CVPR , year=
-
[51]
Deep Modular Co-Attention Networks for Visual Question Answering , author=. CVPR , year=
-
[52]
Baby talk: Understanding and generating image descriptions , author=. CVPR , year=
-
[53]
Unsupervised textual grounding: Linking words to image concepts , author=. CVPR , year=
-
[54]
Knowledge aided consistency for weakly supervised phrase grounding , author=. CVPR , year=
-
[55]
Gardner, Matt and Grus, Joel and Neumann, Mark and Tafjord, Oyvind and Dasigi, Pradeep and Liu, Nelson F and Peters, Matthew and Schmitz, Michael and Zettlemoyer, Luke , booktitle=
-
[56]
Deep biaffine attention for neural dependency parsing , author=. ICLR , year=
-
[57]
Query-adaptive R-CNN for open-vocabulary object detection and retrieval , author=. arXiv preprint arXiv:1711.09509 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Interpretable and globally optimal prediction for textual grounding using image concepts , author=. NeurIPS , year=
-
[59]
Ross Girshick and Ilija Radosavovic and Georgia Gkioxari and Piotr Doll\'. Detectron , howpublished =
- [60]
-
[61]
A survey on still image based human action recognition , author=. Pattern Recognition , volume=
-
[62]
Situation recognition: Visual semantic role labeling for image understanding , author=. CVPR , year=
- [63]
-
[64]
Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian , booktitle=
-
[65]
Krishna, Ranjay and Zhu, Yuke and Groth, Oliver and Johnson, Justin and Hata, Kenji and Kravitz, Joshua and Chen, Stephanie and Kalantidis, Yannis and Li, Li-Jia and Shamma, David A and others , journal=
- [66]
- [67]
-
[68]
Neural motifs: Scene graph parsing with global context , author=. CVPR , year=
-
[69]
Visual semantic role labeling , author=. arXiv preprint arXiv:1505.04474 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Distributed representations of words and phrases and their compositionality , author=. NeurIPS , pages=
-
[71]
and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E
Liu, Nelson F. and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E. and Smith, Noah A. , title =. NAACL-HLT , year =
-
[72]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Lu Jiasen and Batra Dhruv and Parikh Devi and Lee Lee , journal=
-
[74]
Show, attend and tell: Neural image caption generation with visual attention , author=. ICML , year=
-
[75]
Generation and comprehension of unambiguous object descriptions , author=. CVPR , year=
- [76]
-
[77]
Phrase localization and visual relationship detection with comprehensive image-language cues , author=. CVPR , year=
-
[78]
Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara , booktitle=
-
[79]
Ask me anything: Free-form visual question answering based on knowledge from external sources , author=. CVPR , year=
-
[80]
Vizwiz grand challenge: Answering visual questions from blind people , author=. CVPR , year=
-
[81]
Visual Madlibs: Fill in the blank Image Generation and Question Answering
Visual madlibs: Fill in the blank image generation and question answering , author=. arXiv preprint arXiv:1506.00278 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [82]
-
[83]
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
Unifying visual-semantic embeddings with multimodal neural language models , author=. arXiv preprint arXiv:1411.2539 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.