pith. sign in

arxiv: 1906.08595 · v1 · pith:ANBGJTW2new · submitted 2019-06-20 · 💻 cs.MM · cs.IR

Understanding, Categorizing and Predicting Semantic Image-Text Relations

Pith reviewed 2026-05-25 19:12 UTC · model grok-4.3

classification 💻 cs.MM cs.IR
keywords semantic image-text relationsmultimodal embeddingsimage-text categorizationcross-modal mutual informationsemantic correlationdeep learning predictionmultimodal information retrieval
0
0 comments X

The pith

Eight semantic image-text classes can be defined by three metrics and predicted by a deep learning system using multimodal embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to move past image captioning by defining how images and text interact in meaningful ways. It introduces eight distinct classes of semantic relations, such as illustration and anchorage, and shows they can be described consistently through cross-modal mutual information, semantic correlation, and the relative status of the two modalities. A deep learning model trained on automatically gathered data learns to assign new image-text pairs to these classes. This matters for applications like web search and recommender systems that need to handle the specific purpose of each image-text combination rather than treating them as simple pairs.

Core claim

The paper derives a categorization of eight semantic image-text classes and shows how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. It further presents a deep learning system that predicts these classes by utilizing multimodal embeddings, trained on data automatically collected and augmented from a variety of datasets and web resources.

What carries the argument

The categorization of eight semantic image-text classes, each defined by the combination of cross-modal mutual information, semantic correlation, and status relation between image and text.

If this is right

  • Multimodal web search and recommender systems can incorporate the specific semantic role of each image-text pair rather than treating all pairs uniformly.
  • Automatic understanding of complementary information in online news, videos, and scientific publications becomes feasible through class prediction.
  • The collected training data supports further experiments on predicting or retrieving image-text relations.
  • Image captioning systems can be extended to also output the semantic class of the relation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The three-metric characterization could be applied to rank or filter image-text results in existing search engines.
  • The classes might reveal patterns in how different domains, such as education versus news, prefer certain relations.
  • If the prediction model works, it could serve as a preprocessing step for other multimodal tasks like visual question answering.

Load-bearing premise

The automatically collected and augmented data from various datasets and web resources provides a sufficiently large and accurate training set that enables the deep learning system to generalize to real-world image-text pairs without significant labeling noise or selection bias.

What would settle it

Evaluation of the trained deep learning system on a new test set of manually labeled real-world image-text pairs drawn from news, educational, and scientific sources, where prediction accuracy for the eight classes falls near random baseline.

Figures

Figures reproduced from arXiv: 1906.08595 by Avishek Anand, Christian Otto, Matthias Springstein, Ralph Ewerth.

Figure 1
Figure 1. Figure 1: An example of a complex message por￾trayed by an image-text pair elucidating the semantic gap between the textual information and the image content. (Source: [18]) In this paper, we leverage taxonomies from visual communi￾cation research and derive a set of eight computable, semantic image-text relations for multimodal indexing and search. These image-text relations are systematically characterized by thre… view at source ↗
Figure 2
Figure 2. Figure 2: Part of Martinec and Salway’s taxonomy that dis [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed image-text classes and their potential use cases. However, one drawback of taxonomies in communication science is that their level of detail makes it sometimes difficult to assign image-text pairs to a particular class, as criticized by Bateman [5]. First, we evaluate the image-text classes described in communica￾tion science literature according to their usefulness for information… view at source ↗
Figure 4
Figure 4. Figure 4: Our categorization of image-text relations. Dis [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples for the Uncorrelated (left), Interdepen￾dent (middle) and Complementary (right) classes. (Sources: see Section 4.1) Complementary (cmi = 1,sc = 1,stat = 0) The class Complementary comprises the classic interplay between visual and textual information, where both of them share infor￾mation but also provide information that the other one does not. Neither of them is dependent on the other one and th… view at source ↗
Figure 6
Figure 6. Figure 6: Examples for the Contrasting (left), Bad Illustra￾tion (middle), and Bad Anchorage (right) classes. (Sources: see Section 4.1) textual description relates to a visual concept in the image, there is cross-modal mutual information and CMI > 0. Cases B: cmi = 0,sc = 0,stat = T, I The metric combination cmi = 0,sc = 0,stat = 0 describes the class Uncorrelated of image-text pairs which are neither in contextual… view at source ↗
Figure 4
Figure 4. Figure 4: The deep learning architecture is explained in section 4.2. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: General structure of the deep learning system with [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Results for both classifiers. 5.2 Discussion of results As shown by [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of data sets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript derives a categorization of eight semantic image-text classes (e.g., 'illustration' or 'anchorage') and shows how they can be characterized by three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. It further presents a deep learning system that predicts these classes from multimodal embeddings, trained on automatically collected and augmented data from multiple datasets and web resources, claiming that results on a demanding test set demonstrate feasibility for applications in multimodal information retrieval.

Significance. If the taxonomy, metric-based characterization, and predictor are shown to be reliable, the work would provide a structured framework for semantic image-text relations that goes beyond captioning, with potential to improve multimodal search and recommender systems by capturing the purpose and interplay of image-text constellations.

major comments (1)
  1. [Data collection and augmentation] The automatic collection and augmentation of training data (described in the abstract and presumably detailed in the data section) is load-bearing for the central prediction claim, yet the manuscript provides no quantitative validation of label accuracy, inter-annotator agreement on the derived taxonomy, or analysis of selection bias relative to the three metric definitions. Without this, it is unclear whether the DL system learns the intended classes or collection artifacts.
minor comments (1)
  1. The abstract refers to 'experimental results on a demanding test set' without specifying the evaluation metrics, baselines, or how the test set was constructed to be demanding; this should be expanded for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Data collection and augmentation] The automatic collection and augmentation of training data (described in the abstract and presumably detailed in the data section) is load-bearing for the central prediction claim, yet the manuscript provides no quantitative validation of label accuracy, inter-annotator agreement on the derived taxonomy, or analysis of selection bias relative to the three metric definitions. Without this, it is unclear whether the DL system learns the intended classes or collection artifacts.

    Authors: We agree that additional validation of the automatically collected data would strengthen the central claim. The data collection heuristics were explicitly designed to instantiate the three metric definitions (e.g., selecting pairs from sources that satisfy the cross-modal mutual information and status-relation criteria for each class). Nevertheless, the manuscript does not report quantitative checks such as manual label accuracy on a held-out sample or bias analysis. We will therefore add a new subsection in the revised version that (i) describes a manual audit of a random sample of the training instances with reported accuracy figures, (ii) discusses observed selection biases relative to the metric definitions, and (iii) clarifies that the taxonomy itself is synthesized from prior visual-communication literature rather than a fresh multi-annotator study, so inter-annotator agreement statistics are not applicable in the same way. These additions will make the reliability of the training labels explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy and metrics derived independently; prediction trained on external data.

full rationale

The paper derives its eight-class taxonomy and three characterizing metrics (cross-modal mutual information, semantic correlation, status relation) from visual communication research, then trains a multimodal embedding model on automatically collected external datasets. No equations, self-citations, or steps reduce the claimed predictions or characterizations to fitted inputs or prior author results by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim depends on the appropriateness of the three metrics as definitional and the quality of the automatically gathered training data.

free parameters (1)
  • deep learning model parameters
    The neural network weights are fitted during training on the collected data.
axioms (1)
  • domain assumption The three metrics sufficiently capture the semantic relations for categorization into eight classes.
    This underpins the entire taxonomy and is stated as the way to characterize the classes.

pith-pipeline@v0.9.0 · 5766 in / 1196 out tokens · 31103 ms · 2026-05-25T19:12:38.590788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

  1. [1]

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings . http://arxiv.org/abs/1409.0473

  2. [2]

    Saeid Balaneshin-kordan and Alexander Kotov. 2018. Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 28–36

  3. [3]

    Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multi- modal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)

  4. [4]

    Roland Barthes. 1977. Image-Music-Text, ed. and trans. S. Heath, London: Fontana 332 (1977)

  5. [5]

    John Bateman. 2014. Text and image: A critical introduction to the visual/verbal divide. Routledge

  6. [6]

    John Bateman, Janina Wildfeuer, and Tuomo Hiippala. 2017. Multimodality: Foundations, Research and Analysis–A Problem-Oriented Introduction . Walter de Gruyter GmbH & Co KG

  7. [7]

    Serhat S Bucak, Rong Jin, and Anil K Jain. 2014. Multiple kernel learning for visual object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2014), 1354–1369

  8. [8]

    Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/

  9. [9]

    Szegedy et al. 2015. Going deeper with convolutions.IEEE Conference on Computer Vision and Pattern Recognition

  10. [10]

    Mengdi Fan, Wenmin Wang, Peilei Dong, Liang Han, Ronggang Wang, and Ge Li. 2017. Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia. ACM Multimedia Conference (2017)

  11. [11]

    Mehmet Gönen and Ethem Alpaydın. 2011. Multiple kernel learning algorithms. Journal of machine learning research 12, Jul (2011), 2211–2268

  12. [12]

    Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bag of Tricks for Efficient Text Classification. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers . 427–431. https: //aclanthology.info/papers/E17-2068/e17-2068

  13. [13]

    Michael Alexander Kirkwood Halliday and Christian MIM Matthiessen. 2013. Halliday’s introduction to functional grammar. Routledge

  14. [14]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition

  15. [15]

    Christian Andreas Henning and Ralph Ewerth. 2017. Estimating the Information Gap between Textual and Visual Representations. ACM International Conference on Multimedia Retrieval (2017)

  16. [16]

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger

  17. [17]

    In Proceedings of the IEEE conference on computer vision and pattern recognition

    Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition . 4700–4708

  18. [18]

    Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al

    Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual Storytelling. Conference of the North American Chapter of the Association for Computational Linguistics

  19. [19]

    Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic Understanding of Image and Video Advertisements. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. 1100–1110. https://doi.org/10.1109/CVPR.2017.123

  20. [20]

    Natasha Jaques, Sara Taylor, Akane Sano, and Rosalind Picard. 2015. Multi-task, multi-kernel learning for estimating individual wellbeing. InProc. NIPS Workshop on Multimodal Machine Learning, Montreal, Quebec , Vol. 898

  21. [21]

    Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolu- tional localization networks for dense captioning. IEEE Conference on Computer Vision and Pattern Recognition

  22. [22]

    Kevin Joslyn, Kai Li, and Kien A Hua. 2018. Cross-Modal Retrieval Using Deep De-correlated Subspace Ranking Hashing. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval . ACM, 55–63

  23. [23]

    Andrej Karpathy, Armand Joulin, and Fei Fei F Li. 2014. Deep fragment embed- dings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems

  24. [24]

    Klaus Krippendorff. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement 30, 1 (1970)

  25. [25]

    Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-guided cross-lingual image captioning. InProceedings of the 2017 ACM on Multimedia Conference. ACM, 1549–1557

  26. [26]

    Jian Liang, Zhihang Li, Dong Cao, Ran He, and Jingdong Wang. 2016. Self- Paced Cross-Modal Subspace Matching. ACM SIGIR Conference on Research and Development in Information Retrieval

  27. [27]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision , David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755

  28. [28]

    Fayao Liu, Luping Zhou, Chunhua Shen, and Jianping Yin. 2014. Multiple kernel learning in the primal for multimodal Alzheimer’s disease classification. IEEE J. Biomedical and Health Informatics 18, 3 (2014), 984–990

  29. [29]

    Radan Martinec and Andrew Salway. 2005. A system for image-text relations in new (and old) media. Visual Communication 4 (2005)

  30. [30]

    Masoud Mazloom, Robert Rietveld, Stevan Rudinac, Marcel Worring, and Willemijn van Dolen. 2016. Multimodal Popularity Prediction of Brand-related Social Media Posts. ACM Multimedia Conference

  31. [31]

    Scott McCloud. 1993. Understanding comics: The invisible art. Northampton, Mass (1993)

  32. [32]

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems

  33. [33]

    Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy- Chowdhury. 2018. Learning Joint Embedding with Multimodal Cues for Cross- Modal Video-Text Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval . ACM, 19–27

  34. [34]

    Papalexakis, and Amit K

    Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, and Amit K. Roy-Chowdhury. 2018. Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (MM ’18) . ACM, New York, NY, USA, 1856–1864. https://doi.org/10.1145/3240508.3240712

  35. [35]

    Winfried Nöth. 1995. Handbook of semiotics. Indiana University Press

  36. [36]

    2017-11-23

    My English Pages. 2017-11-23. List of antonyms and opposites. http://www. myenglishpages.com/site_php_files/vocabulary-lesson-opposites.php

  37. [37]

    Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing . 2539–2544

  38. [38]

    Jinwei Qi, Yuxin Peng, and Yunkan Zhuo. 2018. Life-long Cross-media Correlation Learning. In 2018 ACM Multimedia Conference on Multimedia Conference . ACM

  39. [39]

    Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal Video Descrip- tion. ACM Multimedia Conference

  40. [40]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (2015)

  41. [41]

    Rossano Schifanella, Paloma de Juan, Joel Tetreault, and Liangliang Cao. 2016. Detecting Sarcasm in Multimodal Social Platforms. ACM Multimedia Conference

  42. [42]

    Ekaterina Shutova, Douwe Kelia, and Jean Maillard. 2016. Black Holes and White Rabbits : Metaphor Identification with Visual Features. Naacl (2016)

  43. [43]

    Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 12 (2000)

  44. [44]

    Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi

  45. [45]

    Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. AAAI

  46. [46]

    Len Unsworth. 2007. Image/text relations and intersemiosis: Towards multi- modal text description for multiliteracies education. In Proceedings of the 33rd International Systemic Functional Congress . 1165–1205

  47. [47]

    Theo Van Leeuwen. 2005. Introducing Social Semiotics. Psychology Press

  48. [48]

    Liang Xie, Peng Pan, Yansheng Lu, and Shixun Wang. 2014. A cross-modal multi- task learning framework for image annotation. InACM Conference on Information and Knowledge Management. ACM

  49. [49]

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning

  50. [50]

    Nan Xu and Wenji Mao. 2017. MultiSentiNet: A Deep Semantic Network for Mul- timodal Sentiment Analysis. In ACM on Conference on Information and Knowledge Management. ACM

  51. [51]

    Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang

  52. [52]

    In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

    Modal-adversarial Semantic Learning Network for Extendable Cross- modal Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 46–54

  53. [53]

    Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Ed- uard H Hovy. 2016. Hierarchical Attention Networks for Document Classification. North American Chapter of the Association for Computational Linguistics: Human Language Technologies

  54. [54]

    Yi-Ren Yeh, Ting-Chu Lin, Yung-Yu Chung, and Yu-Chiang Frank Wang. 2012. A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection. IEEE Transactions on multimedia 14, 3 (2012), 563–574