pith. sign in

arxiv: 1907.08948 · v1 · pith:NZSPOIPCnew · submitted 2019-07-21 · 💻 cs.CL

Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation

Pith reviewed 2026-05-24 18:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal machine translationEnglish-Hindi translationVisual Genomeimage caption translationambiguity resolutiondatasetWAT 2019
0
0 comments X

The pith

Hindi Visual Genome supplies the first multimodal dataset of 31,525 image-caption pairs for English-to-Hindi machine translation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hindi Visual Genome by taking short English captions and their images from the existing Visual Genome collection and producing corresponding Hindi translations. Automatic translations were post-edited by humans who viewed the associated images to ensure visual grounding and to resolve ambiguities. A challenge test set of 1,400 segments was further selected by embedding-similarity search for English words whose meaning the image can clarify. The resulting resource is released for non-commercial research and serves as the official data for the WAT 2019 multimodal translation task. The work therefore supplies both training material and an evaluation benchmark that links visual context directly to Hindi output.

Core claim

We present Hindi Visual Genome, a multimodal dataset of 31,525 English segments paired with images and their manually post-edited Hindi translations, together with a 1,400-segment challenge test set chosen for image-resolvable ambiguities; this constitutes the first public resource for multimodal English-Hindi machine translation.

What carries the argument

Manual post-editing of automatic Hindi translations while viewing the associated images, plus embedding-similarity search to isolate ambiguous English words that images can disambiguate.

If this is right

  • Multimodal MT systems can now be trained and evaluated on English-Hindi pairs that use image context to choose among possible Hindi renderings.
  • The same image-informed post-editing method can be applied to produce grounded translations for additional language pairs.
  • The released data directly supports the official WAT 2019 multimodal translation shared task.
  • Hindi image-labeling and captioning tools can be built from the same paired segments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset may expose whether current text-only MT systems systematically fail on visually ambiguous English input when producing Hindi.
  • Comparable resources for other Indic languages could be generated from the same Visual Genome English base with modest additional effort.
  • Performance gains on the challenge test set versus the standard test set would quantify the practical benefit of visual context for this language pair.

Load-bearing premise

Human post-editors who see the images produce Hindi translations that are both accurate and visually grounded, and the embedding search reliably finds ambiguities that the pictures can resolve.

What would settle it

A side-by-side human evaluation in which the image-guided Hindi translations show no measurable improvement in accuracy or appropriateness over purely text-based translations would falsify the central value of the dataset construction process.

Figures

Figures reproduced from arXiv: 1907.08948 by Ond\v{r}ej Bojar, Satya Ranjan Dash, Shantipriya Parida.

Figure 1
Figure 1. Figure 1: Overall pipeline for ambiguous word finding from input corpus. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of two meanings of the word “penalty” exemplified with two images. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Visual Genome is a dataset connecting structured image information with English language. We present ``Hindi Visual Genome'', a multimodal dataset consisting of text and images suitable for English-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing which took the associated images into account. We prepared a set of 31525 segments, accompanied by a challenge test set of 1400 segments. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Our dataset is the first for multimodal English-Hindi machine translation, freely available for non-commercial research purposes. Our Hindi version of Visual Genome also allows to create Hindi image labelers or other practical tools. Hindi Visual Genome also serves in Workshop on Asian Translation (WAT) 2019 Multi-Modal Translation Task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the Hindi Visual Genome dataset, a collection of 31,525 English-Hindi image-caption pairs for multimodal machine translation, created by selecting segments from Visual Genome, automatically translating to Hindi, and performing manual post-editing informed by the images. It also includes a 1,400-segment challenge test set constructed by identifying ambiguous English words via embedding similarity and selecting those where the image resolves the ambiguity. The dataset is claimed to be the first for English-to-Hindi multimodal MT and is used in the WAT 2019 Multi-Modal Translation Task.

Significance. If the image-guided post-editing produces accurate translations and the test set items are correctly identified as benefiting from visual context, this dataset would fill an important gap in multimodal resources for Hindi, enabling research in visually grounded translation for a major language. The public availability and use in a shared task add to its potential utility for the community.

major comments (1)
  1. [Abstract] Abstract: The manuscript describes the dataset creation process (automatic translation followed by image-aware manual post-editing, plus embedding-similarity search for the test set) but provides no quantitative measures such as translation quality metrics (e.g., BLEU or TER), inter-annotator agreement, count of image-driven edits, or empirical verification that the selected test items are resolved by images; this is load-bearing for the central claim that the 31,525 + 1,400 segments form a reliable multimodal resource.
minor comments (1)
  1. [Abstract] Abstract: The description of the automatic translation step does not name the MT system or provide any details on the post-editing guidelines or annotator qualifications.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript. We address the major comment point-by-point below and outline planned revisions to strengthen the presentation of the dataset's reliability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript describes the dataset creation process (automatic translation followed by image-aware manual post-editing, plus embedding-similarity search for the test set) but provides no quantitative measures such as translation quality metrics (e.g., BLEU or TER), inter-annotator agreement, count of image-driven edits, or empirical verification that the selected test items are resolved by images; this is load-bearing for the central claim that the 31,525 + 1,400 segments form a reliable multimodal resource.

    Authors: We agree that the original manuscript lacks explicit quantitative validation of the post-editing quality and test-set construction. The dataset release paper focuses on the curation pipeline and public release for the WAT 2019 task rather than model evaluation; however, we will revise the manuscript to include: (1) BLEU scores computed against a small human-verified subset of the post-edited Hindi translations, (2) a count and categorization of image-driven edits performed during post-editing (e.g., sense disambiguation cases), and (3) additional qualitative examples plus a small-scale human evaluation confirming that the 1,400 challenge items are resolved by visual context. Inter-annotator agreement is not available because post-editing was performed by a single expert translator with image access; we will explicitly note this limitation and its implications. These additions will be placed in a new 'Dataset Validation' subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: pure dataset construction with no derivation chain

full rationale

The paper describes direct construction of the Hindi Visual Genome dataset by selecting English segments and images from Visual Genome, performing automatic translation followed by image-aware manual post-editing, and building a challenge test set via embedding similarity search plus manual selection of image-resolvable ambiguities. No equations, predictions, fitted parameters, or load-bearing self-citations exist; the work is a self-contained data release effort whose claims rest on the described curation process itself rather than any internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on domain assumptions about translation quality and visual disambiguation without independent validation metrics in the provided abstract; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Automatic translation followed by manual post-editing that takes images into account yields accurate, visually grounded Hindi captions.
    Described as the core creation method in the abstract.
  • domain assumption Embedding similarity can locate ambiguous English words for which the associated image resolves the correct sense.
    Basis for constructing the challenge test set.

pith-pipeline@v0.9.0 · 5712 in / 1384 out tokens · 29794 ms · 2026-05-24T18:49:18.710789+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 7 internal anchors

  1. [1]

    Unraveling the contribution of image caption- ing and neural machine translation for multimodal machine translation

    Chiraag Lala, Pranava Madhyastha, Josiah Wang, and Lucia Specia. Unraveling the contribution of image caption- ing and neural machine translation for multimodal machine translation. The Prague Bulletin of Mathematical Linguistics, 108(1):197–208, 2017

  2. [2]

    Proceedings of the Sixth Workshop on Vision and Language, VL@EACL 2017, Valencia, Spain, April 4, 2017

    Anya Belz, Erkut Erdem, Katerina Pastra, and Krystian Mikolajczyk, editors. Proceedings of the Sixth Workshop on Vision and Language, VL@EACL 2017, Valencia, Spain, April 4, 2017 . Association for Computational Linguistics, 2017. 8http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2019/index.html 9https://ufal.mff.cuni.cz/hindi-visual-genome/wat-2019-multimodal-task ...

  3. [3]

    Imagination improves Multimodal Translation

    Desmond Elliott and Ákos Kádár. Imagination improves multimodal translation. CoRR, abs/1705.04350, 2017

  4. [4]

    A visual attention grounding neural model for multimodal machine translation

    Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, and Zhou Yu. A visual attention grounding neural model for multimodal machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3643–3653, 2018

  5. [5]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017

  6. [6]

    Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation

    Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios P. Spithourakis, and Lucy Vanderwende. Image-grounded conversations: Multimodal context for natural question and response generation. CoRR, abs/1701.08251, 2017

  7. [7]

    Dense Captioning with Joint Inference and Visual Context

    Linjie Yang, Kevin D. Tang, Jianchao Yang, and Li-Jia Li. Dense captioning with joint inference and visual context. CoRR, abs/1611.06949, 2016

  8. [8]

    Chang Liu, Fuchun Sun, Changhu Wang, Feng Wang, and Alan L. Yuille. MAT: A multimodal attentive translator for image captioning. CoRR, abs/1702.05658, 2017

  9. [9]

    Multi30K: Multilingual English-German Image Descriptions

    Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. Multi30k: Multilingual english-german image descriptions. arXiv preprint arXiv:1605.00459, 2016

  10. [10]

    Findings of the third shared task on multimodal machine translation

    Loïc Barrault, Fethi Bougares, Lucia Specia, Chiraag Lala, Desmond Elliott, and Stella Frank. Findings of the third shared task on multimodal machine translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 304–323, 2018

  11. [11]

    Multimodal neural machine translation for low-resource language pairs using synthetic data

    Koel Dutta Chowdhury, Mohammed Hasanuzzaman, and Qun Liu. Multimodal neural machine translation for low-resource language pairs using synthetic data. ACL 2018, page 33, 2018

  12. [12]

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

  13. [13]

    Word sense disambiguation with pictures

    Kobus Barnard and Matthew Johnson. Word sense disambiguation with pictures. Artificial Intelligence, 167(1- 2):13–30, 2005

  14. [14]

    Tensor2Tensor for Neural Machine Translation

    Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416, 2018

  15. [15]

    Translating Short Segments with NMT: A Case Study in English-to-Hindi

    Shantipriya Parida and Ond ˇrej Bojar. Translating Short Segments with NMT: A Case Study in English-to-Hindi. In Proceedings of EAMT 2018, 2018

  16. [16]

    A systematic comparison of various statistical alignment models

    Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Computa- tional Linguistics, 29(1):19–51, 2003

  17. [17]

    Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013

  18. [18]

    Software Framework for Topic Modelling with Large Corpora

    Radim ˇReh˚ uˇrek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en

  19. [19]

    Recovering the number of clusters in data sets with noise features using feature rescaling factors

    Renato Cordeiro de Amorim and Christian Hennig. Recovering the number of clusters in data sets with noise features using feature rescaling factors. Information Sciences, 324:126–145, 2015

  20. [20]

    D. L. Davies and D. W. Bouldin. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):224–227, April 1979. 6