Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Athanasia Zlatintsi; Nikolaos Gkanatsios; Petros Koutras; Petros Maragos; Vassilis Pitsikalis

arxiv: 1902.05829 · v1 · pith:J2MDVBYFnew · submitted 2019-02-15 · 💻 cs.CV

Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Nikolaos Gkanatsios , Vassilis Pitsikalis , Petros Koutras , Athanasia Zlatintsi , Petros Maragos This is my paper

classification 💻 cs.CV

keywords attentionalmultimodalvisualbranchdeeplyembeddingssupervisedtranslation

0 comments

read the original abstract

Detecting visual relationships, i.e. <Subject, Predicate, Object> triplets, is a challenging Scene Understanding task approached in the past via linguistic priors or spatial information in a single feature branch. We introduce a new deeply supervised two-branch architecture, the Multimodal Attentional Translation Embeddings, where the visual features of each branch are driven by a multimodal attentional mechanism that exploits spatio-linguistic similarities in a low-dimensional space. We present a variety of experiments comparing against all related approaches in the literature, as well as by re-implementing and fine-tuning several of them. Results on the commonly employed VRD dataset [1] show that the proposed method clearly outperforms all others, while we also justify our claims both quantitatively and qualitatively.

This paper has not been read by Pith yet.

Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

discussion (0)