Dense Captioning with Joint Inference and Visual Context

Jianchao Yang; Kevin Tang; Li-Jia Li; Linjie Yang

arxiv: 1611.06949 · v2 · pith:AMT6THCDnew · submitted 2016-11-21 · 💻 cs.CV

Dense Captioning with Joint Inference and Visual Context

Linjie Yang , Kevin Tang , Jianchao Yang , Li-Jia Li This is my paper

classification 💻 cs.CV

keywords densevisualcaptioningmodelarchitecturechallengesconceptconcepts

0 comments

read the original abstract

Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. The goal is to densely detect visual concepts (e.g., objects, object parts, and interactions between them) from images, labeling each with a short descriptive phrase. We identify two key challenges of dense captioning that need to be properly addressed when tackling the problem. First, dense visual concept annotations in each image are associated with highly overlapping target regions, making accurate localization of each visual concept challenging. Second, the large amount of visual concepts makes it hard to recognize each of them by appearance alone. We propose a new model pipeline based on two novel ideas, joint inference and context fusion, to alleviate these two challenges. We design our model architecture in a methodical manner and thoroughly evaluate the variations in architecture. Our final model, compact and efficient, achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73\% compared to the previous best algorithm. Qualitative experiments also reveal the semantic capabilities of our model in dense captioning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation
cs.CL 2019-07 unverdicted novelty 7.0

The paper releases the first multimodal English-Hindi machine translation dataset of 31,525 segments with images and a challenge test set of 1,400 segments selected via embedding similarity for image-resolvable ambiguities.