Understanding, Categorizing and Predicting Semantic Image-Text Relations

Avishek Anand; Christian Otto; Matthias Springstein; Ralph Ewerth

arxiv: 1906.08595 · v1 · pith:ANBGJTW2new · submitted 2019-06-20 · 💻 cs.MM · cs.IR

Understanding, Categorizing and Predicting Semantic Image-Text Relations

Christian Otto , Matthias Springstein , Avishek Anand , Ralph Ewerth This is my paper

Pith reviewed 2026-05-25 19:12 UTC · model grok-4.3

classification 💻 cs.MM cs.IR

keywords semantic image-text relationsmultimodal embeddingsimage-text categorizationcross-modal mutual informationsemantic correlationdeep learning predictionmultimodal information retrieval

0 comments

The pith

Eight semantic image-text classes can be defined by three metrics and predicted by a deep learning system using multimodal embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to move past image captioning by defining how images and text interact in meaningful ways. It introduces eight distinct classes of semantic relations, such as illustration and anchorage, and shows they can be described consistently through cross-modal mutual information, semantic correlation, and the relative status of the two modalities. A deep learning model trained on automatically gathered data learns to assign new image-text pairs to these classes. This matters for applications like web search and recommender systems that need to handle the specific purpose of each image-text combination rather than treating them as simple pairs.

Core claim

The paper derives a categorization of eight semantic image-text classes and shows how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. It further presents a deep learning system that predicts these classes by utilizing multimodal embeddings, trained on data automatically collected and augmented from a variety of datasets and web resources.

What carries the argument

The categorization of eight semantic image-text classes, each defined by the combination of cross-modal mutual information, semantic correlation, and status relation between image and text.

If this is right

Multimodal web search and recommender systems can incorporate the specific semantic role of each image-text pair rather than treating all pairs uniformly.
Automatic understanding of complementary information in online news, videos, and scientific publications becomes feasible through class prediction.
The collected training data supports further experiments on predicting or retrieving image-text relations.
Image captioning systems can be extended to also output the semantic class of the relation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The three-metric characterization could be applied to rank or filter image-text results in existing search engines.
The classes might reveal patterns in how different domains, such as education versus news, prefer certain relations.
If the prediction model works, it could serve as a preprocessing step for other multimodal tasks like visual question answering.

Load-bearing premise

The automatically collected and augmented data from various datasets and web resources provides a sufficiently large and accurate training set that enables the deep learning system to generalize to real-world image-text pairs without significant labeling noise or selection bias.

What would settle it

Evaluation of the trained deep learning system on a new test set of manually labeled real-world image-text pairs drawn from news, educational, and scientific sources, where prediction accuracy for the eight classes falls near random baseline.

Figures

Figures reproduced from arXiv: 1906.08595 by Avishek Anand, Christian Otto, Matthias Springstein, Ralph Ewerth.

**Figure 1.** Figure 1: An example of a complex message portrayed by an image-text pair elucidating the semantic gap between the textual information and the image content. (Source: [18]) In this paper, we leverage taxonomies from visual communication research and derive a set of eight computable, semantic image-text relations for multimodal indexing and search. These image-text relations are systematically characterized by thre… view at source ↗

**Figure 2.** Figure 2: Part of Martinec and Salway’s taxonomy that dis [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed image-text classes and their potential use cases. However, one drawback of taxonomies in communication science is that their level of detail makes it sometimes difficult to assign image-text pairs to a particular class, as criticized by Bateman [5]. First, we evaluate the image-text classes described in communication science literature according to their usefulness for information… view at source ↗

**Figure 4.** Figure 4: Our categorization of image-text relations. Dis [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Examples for the Uncorrelated (left), Interdependent (middle) and Complementary (right) classes. (Sources: see Section 4.1) Complementary (cmi = 1,sc = 1,stat = 0) The class Complementary comprises the classic interplay between visual and textual information, where both of them share information but also provide information that the other one does not. Neither of them is dependent on the other one and th… view at source ↗

**Figure 6.** Figure 6: Examples for the Contrasting (left), Bad Illustration (middle), and Bad Anchorage (right) classes. (Sources: see Section 4.1) textual description relates to a visual concept in the image, there is cross-modal mutual information and CMI > 0. Cases B: cmi = 0,sc = 0,stat = T, I The metric combination cmi = 0,sc = 0,stat = 0 describes the class Uncorrelated of image-text pairs which are neither in contextual… view at source ↗

**Figure 4.** Figure 4: The deep learning architecture is explained in section 4.2. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 7.** Figure 7: General structure of the deep learning system with [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Results for both classifiers. 5.2 Discussion of results As shown by [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of data sets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The eight-class taxonomy grounded in three metrics is the actual new piece here, but automatic data labeling without validation details is the main soft spot.

read the letter

The main takeaway is the eight-class taxonomy for semantic image-text relations, drawn from visual communication research and defined through cross-modal mutual information, semantic correlation, and status relation. They also build a predictor using multimodal embeddings. This framing moves past standard captioning by focusing on the role and purpose of the pair, such as illustration or anchorage, which could matter for retrieval and recommender work. The systematic characterization via the three metrics is a clear step beyond prior approaches, and collecting data from multiple sources to train the model is a practical move to get scale. Experimental results are claimed to show feasibility on a demanding test set. The soft spot is the data pipeline. Automatic collection and augmentation is used to create the training set, but the abstract gives no numbers on label accuracy, agreement with the metric definitions, or checks for systematic bias. If the labels drift from the intended classes, both the taxonomy application and the prediction results rest on an untested mapping. This is worth a look for people in multimodal IR who want structured ways to handle image-text interplay rather than pure description tasks. A reader focused on new taxonomies or tasks would get value from the categorization itself. The central argument is coherent on its own terms even if the evidence on data quality is thin so far. Send it to peer review so referees can examine the full experiments and any label validation that exists.

Referee Report

1 major / 1 minor

Summary. The manuscript derives a categorization of eight semantic image-text classes (e.g., 'illustration' or 'anchorage') and shows how they can be characterized by three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. It further presents a deep learning system that predicts these classes from multimodal embeddings, trained on automatically collected and augmented data from multiple datasets and web resources, claiming that results on a demanding test set demonstrate feasibility for applications in multimodal information retrieval.

Significance. If the taxonomy, metric-based characterization, and predictor are shown to be reliable, the work would provide a structured framework for semantic image-text relations that goes beyond captioning, with potential to improve multimodal search and recommender systems by capturing the purpose and interplay of image-text constellations.

major comments (1)

[Data collection and augmentation] The automatic collection and augmentation of training data (described in the abstract and presumably detailed in the data section) is load-bearing for the central prediction claim, yet the manuscript provides no quantitative validation of label accuracy, inter-annotator agreement on the derived taxonomy, or analysis of selection bias relative to the three metric definitions. Without this, it is unclear whether the DL system learns the intended classes or collection artifacts.

minor comments (1)

The abstract refers to 'experimental results on a demanding test set' without specifying the evaluation metrics, baselines, or how the test set was constructed to be demanding; this should be expanded for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major comment point by point below.

read point-by-point responses

Referee: [Data collection and augmentation] The automatic collection and augmentation of training data (described in the abstract and presumably detailed in the data section) is load-bearing for the central prediction claim, yet the manuscript provides no quantitative validation of label accuracy, inter-annotator agreement on the derived taxonomy, or analysis of selection bias relative to the three metric definitions. Without this, it is unclear whether the DL system learns the intended classes or collection artifacts.

Authors: We agree that additional validation of the automatically collected data would strengthen the central claim. The data collection heuristics were explicitly designed to instantiate the three metric definitions (e.g., selecting pairs from sources that satisfy the cross-modal mutual information and status-relation criteria for each class). Nevertheless, the manuscript does not report quantitative checks such as manual label accuracy on a held-out sample or bias analysis. We will therefore add a new subsection in the revised version that (i) describes a manual audit of a random sample of the training instances with reported accuracy figures, (ii) discusses observed selection biases relative to the metric definitions, and (iii) clarifies that the taxonomy itself is synthesized from prior visual-communication literature rather than a fresh multi-annotator study, so inter-annotator agreement statistics are not applicable in the same way. These additions will make the reliability of the training labels explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy and metrics derived independently; prediction trained on external data.

full rationale

The paper derives its eight-class taxonomy and three characterizing metrics (cross-modal mutual information, semantic correlation, status relation) from visual communication research, then trains a multimodal embedding model on automatically collected external datasets. No equations, self-citations, or steps reduce the claimed predictions or characterizations to fitted inputs or prior author results by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim depends on the appropriateness of the three metrics as definitional and the quality of the automatically gathered training data.

free parameters (1)

deep learning model parameters
The neural network weights are fitted during training on the collected data.

axioms (1)

domain assumption The three metrics sufficiently capture the semantic relations for categorization into eight classes.
This underpins the entire taxonomy and is stated as the way to characterize the classes.

pith-pipeline@v0.9.0 · 5766 in / 1196 out tokens · 31103 ms · 2026-05-25T19:12:38.590788+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings . http://arxiv.org/abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

Saeid Balaneshin-kordan and Alexander Kotov. 2018. Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 28–36

work page 2018
[3]

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multi- modal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)

work page 2018
[4]

Roland Barthes. 1977. Image-Music-Text, ed. and trans. S. Heath, London: Fontana 332 (1977)

work page 1977
[5]

John Bateman. 2014. Text and image: A critical introduction to the visual/verbal divide. Routledge

work page 2014
[6]

John Bateman, Janina Wildfeuer, and Tuomo Hiippala. 2017. Multimodality: Foundations, Research and Analysis–A Problem-Oriented Introduction . Walter de Gruyter GmbH & Co KG

work page 2017
[7]

Serhat S Bucak, Rong Jin, and Anil K Jain. 2014. Multiple kernel learning for visual object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2014), 1354–1369

work page 2014
[8]

Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/

work page 2015
[9]

Szegedy et al. 2015. Going deeper with convolutions.IEEE Conference on Computer Vision and Pattern Recognition

work page 2015
[10]

Mengdi Fan, Wenmin Wang, Peilei Dong, Liang Han, Ronggang Wang, and Ge Li. 2017. Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia. ACM Multimedia Conference (2017)

work page 2017
[11]

Mehmet Gönen and Ethem Alpaydın. 2011. Multiple kernel learning algorithms. Journal of machine learning research 12, Jul (2011), 2211–2268

work page 2011
[12]

Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bag of Tricks for Efficient Text Classification. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers . 427–431. https: //aclanthology.info/papers/E17-2068/e17-2068

work page 2017
[13]

Michael Alexander Kirkwood Halliday and Christian MIM Matthiessen. 2013. Halliday’s introduction to functional grammar. Routledge

work page 2013
[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition

work page 2016
[15]

Christian Andreas Henning and Ralph Ewerth. 2017. Estimating the Information Gap between Textual and Visual Representations. ACM International Conference on Multimedia Retrieval (2017)

work page 2017
[16]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger

work page
[17]

In Proceedings of the IEEE conference on computer vision and pattern recognition

Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition . 4700–4708

work page
[18]

Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al

Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual Storytelling. Conference of the North American Chapter of the Association for Computational Linguistics

work page 2016
[19]

Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic Understanding of Image and Video Advertisements. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. 1100–1110. https://doi.org/10.1109/CVPR.2017.123

work page doi:10.1109/cvpr.2017.123 2017
[20]

Natasha Jaques, Sara Taylor, Akane Sano, and Rosalind Picard. 2015. Multi-task, multi-kernel learning for estimating individual wellbeing. InProc. NIPS Workshop on Multimodal Machine Learning, Montreal, Quebec , Vol. 898

work page 2015
[21]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolu- tional localization networks for dense captioning. IEEE Conference on Computer Vision and Pattern Recognition

work page 2016
[22]

Kevin Joslyn, Kai Li, and Kien A Hua. 2018. Cross-Modal Retrieval Using Deep De-correlated Subspace Ranking Hashing. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval . ACM, 55–63

work page 2018
[23]

Andrej Karpathy, Armand Joulin, and Fei Fei F Li. 2014. Deep fragment embed- dings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems

work page 2014
[24]

Klaus Krippendorff. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement 30, 1 (1970)

work page 1970
[25]

Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-guided cross-lingual image captioning. InProceedings of the 2017 ACM on Multimedia Conference. ACM, 1549–1557

work page 2017
[26]

Jian Liang, Zhihang Li, Dong Cao, Ran He, and Jingdong Wang. 2016. Self- Paced Cross-Modal Subspace Matching. ACM SIGIR Conference on Research and Development in Information Retrieval

work page 2016
[27]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision , David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755

work page 2014
[28]

Fayao Liu, Luping Zhou, Chunhua Shen, and Jianping Yin. 2014. Multiple kernel learning in the primal for multimodal Alzheimer’s disease classification. IEEE J. Biomedical and Health Informatics 18, 3 (2014), 984–990

work page 2014
[29]

Radan Martinec and Andrew Salway. 2005. A system for image-text relations in new (and old) media. Visual Communication 4 (2005)

work page 2005
[30]

Masoud Mazloom, Robert Rietveld, Stevan Rudinac, Marcel Worring, and Willemijn van Dolen. 2016. Multimodal Popularity Prediction of Brand-related Social Media Posts. ACM Multimedia Conference

work page 2016
[31]

Scott McCloud. 1993. Understanding comics: The invisible art. Northampton, Mass (1993)

work page 1993
[32]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems

work page 2013
[33]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy- Chowdhury. 2018. Learning Joint Embedding with Multimodal Cues for Cross- Modal Video-Text Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval . ACM, 19–27

work page 2018
[34]

Papalexakis, and Amit K

Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, and Amit K. Roy-Chowdhury. 2018. Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (MM ’18) . ACM, New York, NY, USA, 1856–1864. https://doi.org/10.1145/3240508.3240712

work page doi:10.1145/3240508.3240712 2018
[35]

Winfried Nöth. 1995. Handbook of semiotics. Indiana University Press

work page 1995
[36]

2017-11-23

My English Pages. 2017-11-23. List of antonyms and opposites. http://www. myenglishpages.com/site_php_files/vocabulary-lesson-opposites.php

work page 2017
[37]

Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing . 2539–2544

work page 2015
[38]

Jinwei Qi, Yuxin Peng, and Yunkan Zhuo. 2018. Life-long Cross-media Correlation Learning. In 2018 ACM Multimedia Conference on Multimedia Conference . ACM

work page 2018
[39]

Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal Video Descrip- tion. ACM Multimedia Conference

work page 2016
[40]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (2015)

work page 2015
[41]

Rossano Schifanella, Paloma de Juan, Joel Tetreault, and Liangliang Cao. 2016. Detecting Sarcasm in Multimodal Social Platforms. ACM Multimedia Conference

work page 2016
[42]

Ekaterina Shutova, Douwe Kelia, and Jean Maillard. 2016. Black Holes and White Rabbits : Metaphor Identification with Visual Features. Naacl (2016)

work page 2016
[43]

Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000)

work page 2000
[44]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi

work page
[45]

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. AAAI

work page
[46]

Len Unsworth. 2007. Image/text relations and intersemiosis: Towards multi- modal text description for multiliteracies education. In Proceedings of the 33rd International Systemic Functional Congress . 1165–1205

work page 2007
[47]

Theo Van Leeuwen. 2005. Introducing Social Semiotics. Psychology Press

work page 2005
[48]

Liang Xie, Peng Pan, Yansheng Lu, and Shixun Wang. 2014. A cross-modal multi- task learning framework for image annotation. InACM Conference on Information and Knowledge Management. ACM

work page 2014
[49]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning

work page 2015
[50]

Nan Xu and Wenji Mao. 2017. MultiSentiNet: A Deep Semantic Network for Mul- timodal Sentiment Analysis. In ACM on Conference on Information and Knowledge Management. ACM

work page 2017
[51]

Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang

work page
[52]

In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Modal-adversarial Semantic Learning Network for Extendable Cross- modal Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 46–54

work page 2018
[53]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Ed- uard H Hovy. 2016. Hierarchical Attention Networks for Document Classification. North American Chapter of the Association for Computational Linguistics: Human Language Technologies

work page 2016
[54]

Yi-Ren Yeh, Ting-Chu Lin, Yung-Yu Chung, and Yu-Chiang Frank Wang. 2012. A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection. IEEE Transactions on multimedia 14, 3 (2012), 563–574

work page 2012

[1] [1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings . http://arxiv.org/abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2015

[2] [2]

Saeid Balaneshin-kordan and Alexander Kotov. 2018. Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 28–36

work page 2018

[3] [3]

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multi- modal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)

work page 2018

[4] [4]

Roland Barthes. 1977. Image-Music-Text, ed. and trans. S. Heath, London: Fontana 332 (1977)

work page 1977

[5] [5]

John Bateman. 2014. Text and image: A critical introduction to the visual/verbal divide. Routledge

work page 2014

[6] [6]

John Bateman, Janina Wildfeuer, and Tuomo Hiippala. 2017. Multimodality: Foundations, Research and Analysis–A Problem-Oriented Introduction . Walter de Gruyter GmbH & Co KG

work page 2017

[7] [7]

Serhat S Bucak, Rong Jin, and Anil K Jain. 2014. Multiple kernel learning for visual object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2014), 1354–1369

work page 2014

[8] [8]

Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/

work page 2015

[9] [9]

Szegedy et al. 2015. Going deeper with convolutions.IEEE Conference on Computer Vision and Pattern Recognition

work page 2015

[10] [10]

Mengdi Fan, Wenmin Wang, Peilei Dong, Liang Han, Ronggang Wang, and Ge Li. 2017. Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia. ACM Multimedia Conference (2017)

work page 2017

[11] [11]

Mehmet Gönen and Ethem Alpaydın. 2011. Multiple kernel learning algorithms. Journal of machine learning research 12, Jul (2011), 2211–2268

work page 2011

[12] [12]

Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bag of Tricks for Efficient Text Classification. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers . 427–431. https: //aclanthology.info/papers/E17-2068/e17-2068

work page 2017

[13] [13]

Michael Alexander Kirkwood Halliday and Christian MIM Matthiessen. 2013. Halliday’s introduction to functional grammar. Routledge

work page 2013

[14] [14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition

work page 2016

[15] [15]

Christian Andreas Henning and Ralph Ewerth. 2017. Estimating the Information Gap between Textual and Visual Representations. ACM International Conference on Multimedia Retrieval (2017)

work page 2017

[16] [16]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger

work page

[17] [17]

In Proceedings of the IEEE conference on computer vision and pattern recognition

Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition . 4700–4708

work page

[18] [18]

Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al

Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual Storytelling. Conference of the North American Chapter of the Association for Computational Linguistics

work page 2016

[19] [19]

Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic Understanding of Image and Video Advertisements. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. 1100–1110. https://doi.org/10.1109/CVPR.2017.123

work page doi:10.1109/cvpr.2017.123 2017

[20] [20]

Natasha Jaques, Sara Taylor, Akane Sano, and Rosalind Picard. 2015. Multi-task, multi-kernel learning for estimating individual wellbeing. InProc. NIPS Workshop on Multimodal Machine Learning, Montreal, Quebec , Vol. 898

work page 2015

[21] [21]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolu- tional localization networks for dense captioning. IEEE Conference on Computer Vision and Pattern Recognition

work page 2016

[22] [22]

Kevin Joslyn, Kai Li, and Kien A Hua. 2018. Cross-Modal Retrieval Using Deep De-correlated Subspace Ranking Hashing. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval . ACM, 55–63

work page 2018

[23] [23]

Andrej Karpathy, Armand Joulin, and Fei Fei F Li. 2014. Deep fragment embed- dings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems

work page 2014

[24] [24]

Klaus Krippendorff. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement 30, 1 (1970)

work page 1970

[25] [25]

Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-guided cross-lingual image captioning. InProceedings of the 2017 ACM on Multimedia Conference. ACM, 1549–1557

work page 2017

[26] [26]

Jian Liang, Zhihang Li, Dong Cao, Ran He, and Jingdong Wang. 2016. Self- Paced Cross-Modal Subspace Matching. ACM SIGIR Conference on Research and Development in Information Retrieval

work page 2016

[27] [27]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision , David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755

work page 2014

[28] [28]

Fayao Liu, Luping Zhou, Chunhua Shen, and Jianping Yin. 2014. Multiple kernel learning in the primal for multimodal Alzheimer’s disease classification. IEEE J. Biomedical and Health Informatics 18, 3 (2014), 984–990

work page 2014

[29] [29]

Radan Martinec and Andrew Salway. 2005. A system for image-text relations in new (and old) media. Visual Communication 4 (2005)

work page 2005

[30] [30]

Masoud Mazloom, Robert Rietveld, Stevan Rudinac, Marcel Worring, and Willemijn van Dolen. 2016. Multimodal Popularity Prediction of Brand-related Social Media Posts. ACM Multimedia Conference

work page 2016

[31] [31]

Scott McCloud. 1993. Understanding comics: The invisible art. Northampton, Mass (1993)

work page 1993

[32] [32]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems

work page 2013

[33] [33]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy- Chowdhury. 2018. Learning Joint Embedding with Multimodal Cues for Cross- Modal Video-Text Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval . ACM, 19–27

work page 2018

[34] [34]

Papalexakis, and Amit K

Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, and Amit K. Roy-Chowdhury. 2018. Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (MM ’18) . ACM, New York, NY, USA, 1856–1864. https://doi.org/10.1145/3240508.3240712

work page doi:10.1145/3240508.3240712 2018

[35] [35]

Winfried Nöth. 1995. Handbook of semiotics. Indiana University Press

work page 1995

[36] [36]

2017-11-23

My English Pages. 2017-11-23. List of antonyms and opposites. http://www. myenglishpages.com/site_php_files/vocabulary-lesson-opposites.php

work page 2017

[37] [37]

Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing . 2539–2544

work page 2015

[38] [38]

Jinwei Qi, Yuxin Peng, and Yunkan Zhuo. 2018. Life-long Cross-media Correlation Learning. In 2018 ACM Multimedia Conference on Multimedia Conference . ACM

work page 2018

[39] [39]

Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal Video Descrip- tion. ACM Multimedia Conference

work page 2016

[40] [40]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (2015)

work page 2015

[41] [41]

Rossano Schifanella, Paloma de Juan, Joel Tetreault, and Liangliang Cao. 2016. Detecting Sarcasm in Multimodal Social Platforms. ACM Multimedia Conference

work page 2016

[42] [42]

Ekaterina Shutova, Douwe Kelia, and Jean Maillard. 2016. Black Holes and White Rabbits : Metaphor Identification with Visual Features. Naacl (2016)

work page 2016

[43] [43]

Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000)

work page 2000

[44] [44]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi

work page

[45] [45]

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. AAAI

work page

[46] [46]

Len Unsworth. 2007. Image/text relations and intersemiosis: Towards multi- modal text description for multiliteracies education. In Proceedings of the 33rd International Systemic Functional Congress . 1165–1205

work page 2007

[47] [47]

Theo Van Leeuwen. 2005. Introducing Social Semiotics. Psychology Press

work page 2005

[48] [48]

Liang Xie, Peng Pan, Yansheng Lu, and Shixun Wang. 2014. A cross-modal multi- task learning framework for image annotation. InACM Conference on Information and Knowledge Management. ACM

work page 2014

[49] [49]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning

work page 2015

[50] [50]

Nan Xu and Wenji Mao. 2017. MultiSentiNet: A Deep Semantic Network for Mul- timodal Sentiment Analysis. In ACM on Conference on Information and Knowledge Management. ACM

work page 2017

[51] [51]

Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang

work page

[52] [52]

In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Modal-adversarial Semantic Learning Network for Extendable Cross- modal Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 46–54

work page 2018

[53] [53]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Ed- uard H Hovy. 2016. Hierarchical Attention Networks for Document Classification. North American Chapter of the Association for Computational Linguistics: Human Language Technologies

work page 2016

[54] [54]

Yi-Ren Yeh, Ting-Chu Lin, Yung-Yu Chung, and Yu-Chiang Frank Wang. 2012. A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection. IEEE Transactions on multimedia 14, 3 (2012), 563–574

work page 2012