arxiv: 2604.18347 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Recognition: unknown

Multilingual Training and Evaluation Resources for Vision-Language Models

Daniela Baiamonte , Elena Fano , Matteo Gabburo , Stefano Simonazzi , Leonardo Rigutini , Andrea Zugarini

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multilingual vision-language modelstraining datasetsevaluation benchmarksregeneration-translationcross-lingual resourcesnon-English performancepositive transfermultimodal data

0 comments

The pith

Training vision-language models on multilingual multimodal examples improves non-English benchmark performance with positive transfer to English.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates new training and evaluation resources for vision-language models across five European languages to overcome the English-centric limitations in current VLM development. It regenerates examples from existing Pixmo datasets into a corpus called Multi-PixMo using permissively licensed models and translates established English benchmarks into French, German, Italian, and Spanish. Experiments across three models show that incorporating these multilingual multimodal examples during training yields consistent gains on the non-English tests. A sympathetic reader would care because the approach provides a practical path to broader language coverage without starting from zero data collection.

Core claim

The central claim is that training VLMs with the regenerated multilingual multimodal examples in Multi-PixMo, derived from Pixmo-Cap, Pixmo-AskModelAnything, and CoSyn-400k, produces consistent improvements on translated versions of MMbench, ScienceQA, MME, POPE, and AI2D across the target languages, while also delivering measurable positive transfer effects back to English benchmarks.

What carries the argument

The regeneration-translation paradigm that combines synthetic generation with permissively licensed models and manual annotation to produce high-quality cross-lingual multimodal training and evaluation data.

If this is right

Multilingual multimodal training data yields consistent gains on non-English VLM benchmarks.
The same multilingual training produces positive transfer that raises English benchmark scores as well.
The Multi-PixMo corpus and translated benchmark suite supply ready-to-use resources for further VLM work in the five languages.
Ablation comparisons confirm the multilingual regime outperforms English-only training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regeneration method could scale to additional languages if similar permissively licensed models are available.
Translated benchmarks may serve as a reliable proxy for measuring true multilingual capability when native test sets are scarce.
The observed transfer to English suggests that exposure to varied language structures during training can strengthen core reasoning even in the dominant language.

Load-bearing premise

The data produced by regenerating examples and translating benchmarks preserves enough semantic fidelity and quality to function as effective training and evaluation material equivalent to native-language resources.

What would settle it

If models trained on the multilingual Multi-PixMo data and evaluated on the translated benchmarks show no improvement or a decline relative to English-only training on the non-English tasks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.18347 by Andrea Zugarini, Daniela Baiamonte, Elena Fano, Leonardo Rigutini, Matteo Gabburo, Stefano Simonazzi.

**Figure 1.** Figure 1: Examples from cap (top) and ama (bottom). The original PixMo annotation in English is shown first, followed by our regenerated version: French caption for cap; QA pair in Italian for ama. vision, leveraging noisy multilingual text-image pairs without curated questionanswer alignment. Although few open models fully disclose their complete training pipelines, several public resources provide insight into c… view at source ↗

**Figure 2.** Figure 2: Unverified inferences in the French caption (are the crab’s legs made of papiermâché or not?—first picture from the left) and in the Spanish answer (is chicken one of the ingredients?—second picture from the left). Counting error (two feet visible instead of one) in the French answer (third picture from the left) and ambiguous color (white or grey?) in the German caption (first picture from the right). or… view at source ↗

read the original abstract

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper ships new multilingual VLM training data and translated benchmarks for five languages, backed by human checks and ablations on three models that show gains from the multilingual mix.

read the letter

The core deliverable is Multi-PixMo, a regenerated training corpus drawn from PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k, plus translated versions of MMBench, ScienceQA, MME, POPE, and AI2D covering English, French, German, Italian, and Spanish. They combine model-based regeneration with manual annotation to create the cross-lingual material. The ablations run on three different VLMs and compare English-only training against the full multilingual set; the multilingual version improves non-English scores and shows some positive carry-over to English. Human qualitative and quantitative reviews with inter-annotator agreement scores give direct evidence that the generated examples are at least usable. That combination of new resources and empirical checks is the useful part. The regeneration-translation route is a reasonable engineering choice for scaling, and the reported human agreement numbers suggest the data is not obviously broken. Still, the paper leaves the exact regeneration prompts and full licensing details thin, and it is not obvious whether the full dataset will be released in a form others can immediately use. The human evaluations focus on surface quality rather than head-to-head fidelity against native-language resources, so it remains possible that some of the ablation gains trace to differences in data construction rather than genuine cross-lingual transfer. This work is aimed at groups already building or fine-tuning VLMs who need non-English training and test material right now. If the data lands publicly with clear documentation, it will be a practical starting point even if later work refines the generation process. I would send it to peer review because the resources are new, the experiments are straightforward, and the gaps are fixable with more detail rather than fatal.

Referee Report

2 major / 2 minor

Summary. The paper introduces Multi-PixMo, a multilingual multimodal training corpus for vision-language models spanning English, French, German, Italian, and Spanish, constructed via a regeneration-translation process from existing English PixMo datasets (PixMo-Cap, PixMo-AskModelAnything, CoSyn-400k) using permissively licensed models. It also provides translated versions of standard VLM evaluation benchmarks (MMBench, ScienceQA, MME, POPE, AI2D). Quality is assessed via human qualitative/quantitative analyses with inter-annotator agreement metrics, and ablation experiments on three VLMs demonstrate that multilingual training yields consistent gains on non-English benchmarks alongside positive transfer to English.

Significance. If the regenerated data preserves semantic and visual fidelity comparable to native resources, the work supplies practical training and evaluation resources that directly address English-centric limitations in VLMs, along with empirical evidence from human evaluations and cross-model ablations supporting the value of multilingual multimodal data. The inclusion of inter-annotator agreement and ablation studies across three models strengthens the empirical foundation.

major comments (2)

[Data construction] Data construction section: The central claim that multilingual training is beneficial rests on the assumption that regenerated examples maintain visual grounding and semantic equivalence to the original English data. However, the manuscript provides only high-level descriptions of the regeneration-translation paradigm without exact prompts, model versions, or quantitative metrics (e.g., grounding error rates or semantic similarity scores) that would allow verification that performance differences in the ablations arise from multilingualism rather than regeneration artifacts.
[Ablation studies] Ablation studies section: The experiments compare English-only vs. multilingual training but lack a control condition using human-curated native multilingual data. This omission makes it difficult to isolate whether observed gains on non-English benchmarks (and English transfer) are due to cross-lingual transfer or to incidental differences in data volume, style, or noise introduced by the regeneration process.

minor comments (2)

[Abstract] Abstract: The phrase 'VLMs aids is consistently beneficial' contains a grammatical error and should be revised for clarity (e.g., 'VLMs is consistently beneficial').
The manuscript does not specify whether the constructed Multi-PixMo dataset and translated benchmarks will be publicly released, which is essential for a resource paper to enable full reproducibility and community use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Data construction] Data construction section: The central claim that multilingual training is beneficial rests on the assumption that regenerated examples maintain visual grounding and semantic equivalence to the original English data. However, the manuscript provides only high-level descriptions of the regeneration-translation paradigm without exact prompts, model versions, or quantitative metrics (e.g., grounding error rates or semantic similarity scores) that would allow verification that performance differences in the ablations arise from multilingualism rather than regeneration artifacts.

Authors: We agree that more precise documentation of the regeneration-translation process is needed to support the central claims. In the revised manuscript, we will add the exact prompts used for regeneration and translation, the specific model versions and licensing details, and a clearer description of the pipeline. While we did not compute automatic metrics such as semantic similarity scores, the human qualitative and quantitative evaluations (including inter-annotator agreement) reported in the paper provide direct evidence of semantic fidelity and visual grounding preservation. We will also include a short discussion of why human evaluation was prioritized and how it addresses potential regeneration artifacts. revision: yes
Referee: [Ablation studies] Ablation studies section: The experiments compare English-only vs. multilingual training but lack a control condition using human-curated native multilingual data. This omission makes it difficult to isolate whether observed gains on non-English benchmarks (and English transfer) are due to cross-lingual transfer or to incidental differences in data volume, style, or noise introduced by the regeneration process.

Authors: We acknowledge that a native human-curated multilingual control would allow stronger isolation of cross-lingual transfer effects. However, constructing such a dataset at the scale of Multi-PixMo is resource-intensive and lies outside the scope of this work, whose goal is to release accessible resources derived from existing English data via regeneration. The design preserves identical visual inputs across languages, and our human evaluations confirm high semantic equivalence. The consistent gains across three VLMs on non-English benchmarks, together with positive English transfer, provide supporting evidence for the value of multilingual training. In the revision we will add an explicit limitations section discussing the lack of native controls and the possibility of regeneration-induced differences in style or noise. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical resource creation and ablation study

full rationale

The paper is an empirical effort that constructs Multi-PixMo via regeneration-translation of existing PixMo datasets using permissively licensed models, translates English benchmarks, reports human qualitative/quantitative analyses with inter-annotator agreement, and runs ablation experiments across three models comparing English-only vs. multilingual training. No equations, fitted parameters, or derivations are presented as predictions. Central claims rest on observable benchmark performance differences rather than self-referential definitions or load-bearing self-citations. The work is self-contained against external benchmarks and human evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the quality of regenerated and translated data rather than new mathematical derivations. No free parameters are fitted to produce the main result.

axioms (2)

domain assumption Regenerated examples from permissively licensed models retain sufficient semantic and visual fidelity for effective VLM training across languages.
Invoked when constructing Multi-PixMo from PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k.
domain assumption Machine-translated benchmarks preserve original meaning, difficulty, and visual-text alignment.
Required for the validity of the multilingual evaluation suite derived from MMBench, ScienceQA, MME, POPE, and AI2D.

pith-pipeline@v0.9.0 · 5568 in / 1542 out tokens · 78747 ms · 2026-05-10T05:06:50.077568+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 14 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: NeurIPS (2022)

Alayrac, J.B., Donahue, J., Luc, P., et al.: Flamingo: A visual language model for few-shot learning. In: NeurIPS (2022)

2022
[3]

In: ICCV (2015)

Antol, S., Agrawal, A., Lu, J., et al.: Vqa: Visual question answering. In: ICCV (2015)

2015
[4]

Birhane, A., Prabhu, V.U.: Large image datasets: A pyrrhic win for computer vision? arXiv preprint arXiv:2006.16923 (2021)

work page arXiv 2006
[5]

FAccT (2021)

Birhane, A., et al.: The forgotten margins of ai ethics. FAccT (2021)

2021
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

In: ACL (2020)

Conneau, A., Khandelwal, K., Goyal, N., et al.: Unsupervised cross-lingual repre- sentation learning at scale. In: ACL (2020)

2020
[8]

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual representation learning at scale. CoRRabs/1911.02116(2019),http://arxiv. org/abs/1911.02116

work page arXiv 1911
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Deitke, M., Clark, C., Lee, S., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

2025
[10]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024)

2024
[11]

In: VL@ACL (2016)

Elliott, D., Frank, S., Barrault, L., Bougares, F., Specia, L.: Multi30k: Multilingual english-german image descriptions. In: VL@ACL (2016)

2016
[12]

In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track (2023)

Fu, C., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track (2023)

2023
[13]

HuggingFace model card (2024)

Google: Gemma 3 technical report. HuggingFace model card (2024)

2024
[14]

International Journal of Computer Vision127(4), 398–414 (2019)

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter. International Journal of Computer Vision127(4), 398–414 (2019)

2019
[15]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

British Journal of Mathematical and Statistical Psychology61(1), 29–48 (2008).https://doi.org/10.1348/000711006X126600

Gwet, K.L.: Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology61(1), 29–48 (2008).https://doi.org/10.1348/000711006X126600

work page doi:10.1348/000711006x126600 2008
[17]

HuggingFace dataset (2024)

HuggingFaceFV: Finevideo dataset. HuggingFace dataset (2024)

2024
[18]

HuggingFace dataset (2024)

HuggingFaceM4: The cauldron. HuggingFace dataset (2024)

2024
[19]

HuggingFace dataset (2024) 16 Baiamonte et al

HuggingFaceM4: Docmatix. HuggingFace dataset (2024) 16 Baiamonte et al

2024
[20]

HuggingFace model card (2024)

HuggingFaceTB: Smolvlm-500m-instruct. HuggingFace model card (2024)

2024
[21]

In: ECCV (2016)

Kembhavi, A., et al.: A diagram is worth a dozen images. In: ECCV (2016)

2016
[22]

Laurençon, H., et al.: Obelics: An open web-scale filtered dataset of interleaved image-text documents (2023)

2023
[23]

Blip: Bootstrapping language- image pre-training for uniﬁed vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

Li, J., et al.: Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. arXiv preprint arXiv:2201.12086 (2022)

work page arXiv 2022
[24]

arXiv preprint arXiv:2406.01268 (2024)

Li, X., Zhang, Y., Chen, W., Zhao, T., Liu, Y.: Alm-bench: A benchmark for evaluating multilingual vision-language models. arXiv preprint arXiv:2406.01268 (2024)

work page arXiv 2024
[25]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2023)

Li, Y., et al.: Evaluating object hallucination in large vision-language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2023)

2023
[26]

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

2023
[27]

Liu, S., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

work page internal anchor Pith review arXiv 2023
[28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2023)

Liu, X., et al.: Cvqa: A benchmark for cross-lingual visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2023)

2023
[29]

HuggingFace dataset (2024)

LMMS Lab: Llava-onevision data. HuggingFace dataset (2024)

2024
[30]

HuggingFace dataset (2024)

LMMS Lab: Llava-video-178k. HuggingFace dataset (2024)

2024
[31]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

Lu, P., Mishra, S., Xia, T., et al.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

2022
[32]

In: ACL (2022)

Masry, A., et al.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In: ACL (2022)

2022
[33]

In: WACV (2021)

Mathew, M., et al.: Docvqa: A dataset for vqa on document images. In: WACV (2021)

2021
[34]

HuggingFace dataset (2024)

OpenGVLab: Mmpr-v1.2 dataset. HuggingFace dataset (2024)

2024
[35]

Patterns (2021)

Paullada, A., et al.: Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns (2021)

2021
[36]

HuggingFace model card (2024)

Qwen Team: Qwen3-vl technical report. HuggingFace model card (2024)

2024
[37]

In: International Conference on Machine Learning (2021)

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)

2021
[38]

In: CVPR (2019)

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M., Batra, D.: Towards vqa models that can read. In: CVPR (2019)

2019
[39]

MTVQA: Benchmarking multilingual text-centric visual question answering.arXiv preprint arXiv:2405.11985, 2024

Tang, Y., Li, J., Li, D., Wang, Y., et al.: Mtvqa: Benchmarking multilingual text- centric visual question answering. arXiv preprint arXiv:2405.11985 (2024)

work page arXiv 2024
[40]

Team, Q.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

HuggingFace dataset (2023)

Wendler, C.: Renderedtext dataset. HuggingFace dataset (2023)

2023
[42]

arXiv preprint arXiv:2502.14846 , year=

Yang, Y., Patel, A., Deitke, M., Gupta, T., Weihs, L., Head, A., Yatskar, M., Callison-Burch, C., Krishna, R., Kembhavi, A., Clark, C.: Scaling text-rich im- age understanding via code-guided synthetic multimodal data generation. arXiv preprint arXiv:2402.XXXX (2025),https://arxiv.org/abs/2502.14846

work page arXiv 2025
[43]

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert (2020),https://arxiv.org/abs/1904.09675

work page internal anchor Pith review arXiv 2020
[44]

Zheng, R., et al.: Mmmb: A multilingual multimodal benchmark for vision- language models. arXiv preprint arXiv:2403.12345 (2024) Multilingual Training and Evaluation Resources 17 7 Appendix 7.1 Supervised Evaluation on Multi-PixMo To validate the quality of the generated multilingual data, we evaluate our mod- elsonheld-outtestsplitsofMulti-PixMo-CapandMu...

work page arXiv 2024