pith. sign in

arxiv: 1809.08697 · v1 · pith:TAQ75G5Anew · submitted 2018-09-23 · 💻 cs.CL · cs.CV

Textually Enriched Neural Module Networks for Visual Question Answering

classification 💻 cs.CL cs.CV
keywords questionansweringvisualimagenetworkbeeninformationneural
0
0 comments X
read the original abstract

Problems at the intersection of language and vision, like visual question answering, have recently been gaining a lot of attention in the field of multi-modal machine learning as computer vision research moves beyond traditional recognition tasks. There has been recent success in visual question answering using deep neural network models which use the linguistic structure of the questions to dynamically instantiate network layouts. In the process of converting the question to a network layout, the question is simplified, which results in loss of information in the model. In this paper, we enrich the image information with textual data using image captions and external knowledge bases to generate more coherent answers. We achieve 57.1% overall accuracy on the test-dev open-ended questions from the visual question answering (VQA 1.0) real image dataset.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

    cs.CV 2023-08 unverdicted novelty 6.0

    DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.