Recognition: unknown
VQA: Visual Question Answering
read the original abstract
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
-
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
Decoder-based VLMs over-align visual features to a universal text subspace, injecting linguistic bias; projecting out its top principal components reduces hallucinations on POPE, CHAIR, AMBER and improves long-form ca...
-
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.