pith. sign in

arxiv: 1805.00545 · v1 · pith:LFYOHZACnew · submitted 2018-05-01 · 💻 cs.CV

Weakly Supervised Attention Learning for Textual Phrases Grounding

classification 💻 cs.CV
keywords learningsupervisedvisualapplicationsattentionattentiveground-truthgrounding
0
0 comments X
read the original abstract

Grounding textual phrases in visual content is a meaningful yet challenging problem with various potential applications such as image-text inference or text-driven multimedia interaction. Most of the current existing methods adopt the supervised learning mechanism which requires ground-truth at pixel level during training. However, fine-grained level ground-truth annotation is quite time-consuming and severely narrows the scope for more general applications. In this extended abstract, we explore methods to localize flexibly image regions from the top-down signal (in a form of one-hot label or natural languages) with a weakly supervised attention learning mechanism. In our model, two types of modules are utilized: a backbone module for visual feature capturing, and an attentive module generating maps based on regularized bilinear pooling. We construct the model in an end-to-end fashion which is trained by encouraging the spatial attentive map to shift and focus on the region that consists of the best matched visual features with the top-down signal. We demonstrate the preliminary yet promising results on a testbed that is synthesized with multi-label MNIST data.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.