Multitask Text-to-Visual Embedding with Titles and Clickthrough Data

Baldo Faieta; Pranav Aggarwal; Saeid Motiian; Zhe Lin

arxiv: 1905.13339 · v1 · pith:2F76QRCNnew · submitted 2019-05-30 · 💻 cs.CV · cs.IR

Multitask Text-to-Visual Embedding with Titles and Clickthrough Data

Pranav Aggarwal , Zhe Lin , Baldo Faieta , Saeid Motiian This is my paper

classification 💻 cs.CV cs.IR

keywords embeddingimagedataencoderlearningmethodproposetext-to-visual

0 comments

read the original abstract

Text-visual (or called semantic-visual) embedding is a central problem in vision-language research. It typically involves mapping of an image and a text description to a common feature space through a CNN image encoder and a RNN language encoder. In this paper, we propose a new method for learning text-visual embedding using both image titles and click-through data from an image search engine. We also propose a new triplet loss function by modeling positive awareness of the embedding, and introduce a novel mini-batch-based hard negative sampling approach for better data efficiency in the learning process. Experimental results show that our proposed method outperforms existing methods, and is also effective for real-world text-to-visual retrieval.

This paper has not been read by Pith yet.

Multitask Text-to-Visual Embedding with Titles and Clickthrough Data

discussion (0)