Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

David A. Shamma; Fei-Fei Li; Joshua Kravitz; Justin Johnson; Kenji Hata; Li-Jia Li; Michael S. Bernstein; Oliver Groth; Ranjay Krishna; Stephanie Chen

arxiv: 1602.07332 · v1 · pith:R5OGK36Tnew · submitted 2016-02-23 · 💻 cs.CV · cs.AI

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna , Yuke Zhu , Oliver Groth , Justin Johnson , Kenji Hata , Joshua Kravitz , Stephanie Chen , Yannis Kalantidis

show 4 more authors

Li-Jia Li David A. Shamma Michael S. Bernstein Fei-Fei Li

This is my paper

classification 💻 cs.CV cs.AI

keywords imageobjectsrelationshipstasksattributesannotationscarriagecognitive

0 comments

read the original abstract

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
Deep Modular Co-Attention Networks for Visual Question Answering
cs.CV 2019-06 conditional novelty 7.0

MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
cs.CV 2026-04 unverdicted novelty 6.0

AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...
LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
cs.CV 2025-09 unverdicted novelty 6.0

LaV-CoT introduces a multi-stage visual CoT pipeline and GRPO training with language-consistency rewards, delivering up to 9.5% accuracy gains on multilingual VQA benchmarks over similar-sized open models.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
Florence: A New Foundation Model for Computer Vision
cs.CV 2021-11 unverdicted novelty 6.0

Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
PIQA: Reasoning about Physical Commonsense in Natural Language
cs.CL 2019-11 accept novelty 6.0

PIQA is a new benchmark showing that current AI models achieve 77% on physical commonsense questions versus humans at 95%.
UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
cs.CV 2026-04 unverdicted novelty 5.0

UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...
Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI
cs.AI 2026-02 unverdicted novelty 5.0

VIRF combines a deterministic logic tutor with LLM planners to achieve zero hazardous action rates in home safety tasks through iterative plan repairs.
VeriGraph: Scene Graphs for Execution Verifiable Robot Planning
cs.RO 2024-11 unverdicted novelty 5.0

VeriGraph integrates VLMs with scene-graph verification to raise robot task success rates by 30-58% over baselines in manipulation scenarios.
GIT: A Generative Image-to-text Transformer for Vision and Language
cs.CV 2022-05 unverdicted novelty 5.0

GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.
Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification
cs.CV 2024-02 unverdicted novelty 4.0

Hybrid knowledge graph embeddings fused with vision transformer features outperform standard techniques on abstract concept classification by integrating situated perceptual knowledge from a new cultural image resource.
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge
cs.CV 2026-04 unverdicted novelty 2.0

The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.