Multi30K: Multilingual English-German Image Descriptions

Desmond Elliott; Khalil Sima'an; Lucia Specia; Stella Frank

arxiv: 1605.00459 · v1 · pith:UJ3HINPCnew · submitted 2016-05-02 · 💻 cs.CL · cs.CV

Multi30K: Multilingual English-German Image Descriptions

Desmond Elliott , Stella Frank , Khalil Sima'an , Lucia Specia This is my paper

classification 💻 cs.CL cs.CV

keywords descriptionsimagedatasetdescriptionenglishmultilingualdatamulti30k

0 comments

read the original abstract

We introduce the Multi30K dataset to stimulate multilingual multimodal research. Recent advances in image description have been demonstrated on English-language datasets almost exclusively, but image description should not be limited to English. This dataset extends the Flickr30K dataset with i) German translations created by professional translators over a subset of the English descriptions, and ii) descriptions crowdsourced independently of the original English descriptions. We outline how the data can be used for multilingual image description and multimodal machine translation, but we anticipate the data will be useful for a broader range of tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence
math.OC 2025-05 conditional novelty 7.0

GT-NSGDm achieves the optimal non-asymptotic convergence rate O(1/T^{(p-1)/(3p-2)}) for decentralized nonconvex stochastic optimization under zero-mean heavy-tailed noise with p-th moment.
Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation
cs.CL 2019-07 unverdicted novelty 7.0

The paper releases the first multimodal English-Hindi machine translation dataset of 31,525 segments with images and a challenge test set of 1,400 segments selected via embedding similarity for image-resolvable ambiguities.
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
cs.CL 2026-05 unverdicted novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
Video-guided Machine Translation with Global Video Context
cs.CV 2026-04 unverdicted novelty 4.0

A globally video-guided multimodal translation framework retrieves semantically related video segments with a vector database and applies attention mechanisms to improve subtitle translation accuracy in long videos.