VQA: Visual Question Answering

Aishwarya Agrawal; C. Lawrence Zitnick; Devi Parikh; Dhruv Batra; Jiasen Lu; Margaret Mitchell; Stanislaw Antol

arxiv: 1505.00468 · v7 · pith:GKRFJKR2new · submitted 2015-05-03 · 💻 cs.CL · cs.CV

VQA: Visual Question Answering

Aishwarya Agrawal , Jiasen Lu , Stanislaw Antol , Margaret Mitchell , C. Lawrence Zitnick , Dhruv Batra , Devi Parikh This is my paper

classification 💻 cs.CL cs.CV

keywords imageanswersopen-endedquestionquestionsvisualansweringcloudcv

0 comments

read the original abstract

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
VCBench: Benchmarking LLMs in Venture Capital
cs.AI 2025-09 unverdicted novelty 7.0

VCBench is a new privacy-preserving benchmark showing LLMs like DeepSeek-V3 achieve over six times the market baseline precision in predicting founder success.
V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?
cs.CV 2024-08 unverdicted novelty 7.0

V-RoAst applies zero-shot VLMs (Gemini-1.5-flash, GPT-4o-mini) to iRAP road safety attribute classification on a new ThaiRAP image dataset and compares them to CNN baselines, finding better generalization to unseen cl...
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
cs.CV 2026-05 conditional novelty 6.0

SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs over-align visual features to a universal text subspace, injecting linguistic bias; projecting out its top principal components reduces hallucinations on POPE, CHAIR, AMBER and improves long-form ca...
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs hallucinate because visual embeddings are over-aligned to a text manifold; projecting out the top principal components of a universal linguistic subspace reduces this bias and improves benchmark per...