Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

· 2018 · cs.CV · arXiv 1808.00265

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.

representative citing papers

ViperGPT: Visual Inference via Python Execution for Reasoning

cs.CV · 2023-03-14 · unverdicted · novelty 7.0

ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.

citing papers explorer

Showing 1 of 1 citing paper.

ViperGPT: Visual Inference via Python Execution for Reasoning cs.CV · 2023-03-14 · unverdicted · none · ref 66 · internal anchor
ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.

Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

fields

years

verdicts

representative citing papers

citing papers explorer