Recognition: unknown
Multilingual Training and Evaluation Resources for Vision-Language Models
Pith reviewed 2026-05-10 05:06 UTC · model grok-4.3
The pith
Training vision-language models on multilingual multimodal examples improves non-English benchmark performance with positive transfer to English.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that training VLMs with the regenerated multilingual multimodal examples in Multi-PixMo, derived from Pixmo-Cap, Pixmo-AskModelAnything, and CoSyn-400k, produces consistent improvements on translated versions of MMbench, ScienceQA, MME, POPE, and AI2D across the target languages, while also delivering measurable positive transfer effects back to English benchmarks.
What carries the argument
The regeneration-translation paradigm that combines synthetic generation with permissively licensed models and manual annotation to produce high-quality cross-lingual multimodal training and evaluation data.
If this is right
- Multilingual multimodal training data yields consistent gains on non-English VLM benchmarks.
- The same multilingual training produces positive transfer that raises English benchmark scores as well.
- The Multi-PixMo corpus and translated benchmark suite supply ready-to-use resources for further VLM work in the five languages.
- Ablation comparisons confirm the multilingual regime outperforms English-only training.
Where Pith is reading between the lines
- The regeneration method could scale to additional languages if similar permissively licensed models are available.
- Translated benchmarks may serve as a reliable proxy for measuring true multilingual capability when native test sets are scarce.
- The observed transfer to English suggests that exposure to varied language structures during training can strengthen core reasoning even in the dominant language.
Load-bearing premise
The data produced by regenerating examples and translating benchmarks preserves enough semantic fidelity and quality to function as effective training and evaluation material equivalent to native-language resources.
What would settle it
If models trained on the multilingual Multi-PixMo data and evaluated on the translated benchmarks show no improvement or a decline relative to English-only training on the non-English tasks, the central claim would be falsified.
Figures
read the original abstract
Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Multi-PixMo, a multilingual multimodal training corpus for vision-language models spanning English, French, German, Italian, and Spanish, constructed via a regeneration-translation process from existing English PixMo datasets (PixMo-Cap, PixMo-AskModelAnything, CoSyn-400k) using permissively licensed models. It also provides translated versions of standard VLM evaluation benchmarks (MMBench, ScienceQA, MME, POPE, AI2D). Quality is assessed via human qualitative/quantitative analyses with inter-annotator agreement metrics, and ablation experiments on three VLMs demonstrate that multilingual training yields consistent gains on non-English benchmarks alongside positive transfer to English.
Significance. If the regenerated data preserves semantic and visual fidelity comparable to native resources, the work supplies practical training and evaluation resources that directly address English-centric limitations in VLMs, along with empirical evidence from human evaluations and cross-model ablations supporting the value of multilingual multimodal data. The inclusion of inter-annotator agreement and ablation studies across three models strengthens the empirical foundation.
major comments (2)
- [Data construction] Data construction section: The central claim that multilingual training is beneficial rests on the assumption that regenerated examples maintain visual grounding and semantic equivalence to the original English data. However, the manuscript provides only high-level descriptions of the regeneration-translation paradigm without exact prompts, model versions, or quantitative metrics (e.g., grounding error rates or semantic similarity scores) that would allow verification that performance differences in the ablations arise from multilingualism rather than regeneration artifacts.
- [Ablation studies] Ablation studies section: The experiments compare English-only vs. multilingual training but lack a control condition using human-curated native multilingual data. This omission makes it difficult to isolate whether observed gains on non-English benchmarks (and English transfer) are due to cross-lingual transfer or to incidental differences in data volume, style, or noise introduced by the regeneration process.
minor comments (2)
- [Abstract] Abstract: The phrase 'VLMs aids is consistently beneficial' contains a grammatical error and should be revised for clarity (e.g., 'VLMs is consistently beneficial').
- The manuscript does not specify whether the constructed Multi-PixMo dataset and translated benchmarks will be publicly released, which is essential for a resource paper to enable full reproducibility and community use.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [Data construction] Data construction section: The central claim that multilingual training is beneficial rests on the assumption that regenerated examples maintain visual grounding and semantic equivalence to the original English data. However, the manuscript provides only high-level descriptions of the regeneration-translation paradigm without exact prompts, model versions, or quantitative metrics (e.g., grounding error rates or semantic similarity scores) that would allow verification that performance differences in the ablations arise from multilingualism rather than regeneration artifacts.
Authors: We agree that more precise documentation of the regeneration-translation process is needed to support the central claims. In the revised manuscript, we will add the exact prompts used for regeneration and translation, the specific model versions and licensing details, and a clearer description of the pipeline. While we did not compute automatic metrics such as semantic similarity scores, the human qualitative and quantitative evaluations (including inter-annotator agreement) reported in the paper provide direct evidence of semantic fidelity and visual grounding preservation. We will also include a short discussion of why human evaluation was prioritized and how it addresses potential regeneration artifacts. revision: yes
-
Referee: [Ablation studies] Ablation studies section: The experiments compare English-only vs. multilingual training but lack a control condition using human-curated native multilingual data. This omission makes it difficult to isolate whether observed gains on non-English benchmarks (and English transfer) are due to cross-lingual transfer or to incidental differences in data volume, style, or noise introduced by the regeneration process.
Authors: We acknowledge that a native human-curated multilingual control would allow stronger isolation of cross-lingual transfer effects. However, constructing such a dataset at the scale of Multi-PixMo is resource-intensive and lies outside the scope of this work, whose goal is to release accessible resources derived from existing English data via regeneration. The design preserves identical visual inputs across languages, and our human evaluations confirm high semantic equivalence. The consistent gains across three VLMs on non-English benchmarks, together with positive English transfer, provide supporting evidence for the value of multilingual training. In the revision we will add an explicit limitations section discussing the lack of native controls and the possibility of regeneration-induced differences in style or noise. revision: partial
Circularity Check
No circularity: empirical resource creation and ablation study
full rationale
The paper is an empirical effort that constructs Multi-PixMo via regeneration-translation of existing PixMo datasets using permissively licensed models, translates English benchmarks, reports human qualitative/quantitative analyses with inter-annotator agreement, and runs ablation experiments across three models comparing English-only vs. multilingual training. No equations, fitted parameters, or derivations are presented as predictions. Central claims rest on observable benchmark performance differences rather than self-referential definitions or load-bearing self-citations. The work is self-contained against external benchmarks and human evaluations.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Regenerated examples from permissively licensed models retain sufficient semantic and visual fidelity for effective VLM training across languages.
- domain assumption Machine-translated benchmarks preserve original meaning, difficulty, and visual-text alignment.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
In: NeurIPS (2022)
Alayrac, J.B., Donahue, J., Luc, P., et al.: Flamingo: A visual language model for few-shot learning. In: NeurIPS (2022)
2022
-
[3]
In: ICCV (2015)
Antol, S., Agrawal, A., Lu, J., et al.: Vqa: Visual question answering. In: ICCV (2015)
2015
- [4]
-
[5]
FAccT (2021)
Birhane, A., et al.: The forgotten margins of ai ethics. FAccT (2021)
2021
-
[6]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
In: ACL (2020)
Conneau, A., Khandelwal, K., Goyal, N., et al.: Unsupervised cross-lingual repre- sentation learning at scale. In: ACL (2020)
2020
-
[8]
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual representation learning at scale. CoRRabs/1911.02116(2019),http://arxiv. org/abs/1911.02116
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)
Deitke, M., Clark, C., Lee, S., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)
2025
-
[10]
In: Proceedings of the 32nd ACM International Conference on Multimedia
Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024)
2024
-
[11]
In: VL@ACL (2016)
Elliott, D., Frank, S., Barrault, L., Bougares, F., Specia, L.: Multi30k: Multilingual english-german image descriptions. In: VL@ACL (2016)
2016
-
[12]
In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track (2023)
Fu, C., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track (2023)
2023
-
[13]
HuggingFace model card (2024)
Google: Gemma 3 technical report. HuggingFace model card (2024)
2024
-
[14]
International Journal of Computer Vision127(4), 398–414 (2019)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter. International Journal of Computer Vision127(4), 398–414 (2019)
2019
-
[15]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Gwet, K.L.: Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology61(1), 29–48 (2008).https://doi.org/10.1348/000711006X126600
-
[17]
HuggingFace dataset (2024)
HuggingFaceFV: Finevideo dataset. HuggingFace dataset (2024)
2024
-
[18]
HuggingFace dataset (2024)
HuggingFaceM4: The cauldron. HuggingFace dataset (2024)
2024
-
[19]
HuggingFace dataset (2024) 16 Baiamonte et al
HuggingFaceM4: Docmatix. HuggingFace dataset (2024) 16 Baiamonte et al
2024
-
[20]
HuggingFace model card (2024)
HuggingFaceTB: Smolvlm-500m-instruct. HuggingFace model card (2024)
2024
-
[21]
In: ECCV (2016)
Kembhavi, A., et al.: A diagram is worth a dozen images. In: ECCV (2016)
2016
-
[22]
Laurençon, H., et al.: Obelics: An open web-scale filtered dataset of interleaved image-text documents (2023)
2023
-
[23]
Li, J., et al.: Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation. arXiv preprint arXiv:2201.12086 (2022)
-
[24]
arXiv preprint arXiv:2406.01268 (2024)
Li, X., Zhang, Y., Chen, W., Zhao, T., Liu, Y.: Alm-bench: A benchmark for evaluating multilingual vision-language models. arXiv preprint arXiv:2406.01268 (2024)
-
[25]
In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2023)
Li, Y., et al.: Evaluating object hallucination in large vision-language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2023)
2023
-
[26]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)
2023
-
[27]
Liu, S., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
work page internal anchor Pith review arXiv 2023
-
[28]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2023)
Liu, X., et al.: Cvqa: A benchmark for cross-lingual visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2023)
2023
-
[29]
HuggingFace dataset (2024)
LMMS Lab: Llava-onevision data. HuggingFace dataset (2024)
2024
-
[30]
HuggingFace dataset (2024)
LMMS Lab: Llava-video-178k. HuggingFace dataset (2024)
2024
-
[31]
In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Lu, P., Mishra, S., Xia, T., et al.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
2022
-
[32]
In: ACL (2022)
Masry, A., et al.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In: ACL (2022)
2022
-
[33]
In: WACV (2021)
Mathew, M., et al.: Docvqa: A dataset for vqa on document images. In: WACV (2021)
2021
-
[34]
HuggingFace dataset (2024)
OpenGVLab: Mmpr-v1.2 dataset. HuggingFace dataset (2024)
2024
-
[35]
Patterns (2021)
Paullada, A., et al.: Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns (2021)
2021
-
[36]
HuggingFace model card (2024)
Qwen Team: Qwen3-vl technical report. HuggingFace model card (2024)
2024
-
[37]
In: International Conference on Machine Learning (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
2021
-
[38]
In: CVPR (2019)
Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M., Batra, D.: Towards vqa models that can read. In: CVPR (2019)
2019
-
[39]
Tang, Y., Li, J., Li, D., Wang, Y., et al.: Mtvqa: Benchmarking multilingual text- centric visual question answering. arXiv preprint arXiv:2405.11985 (2024)
-
[40]
Team, Q.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
HuggingFace dataset (2023)
Wendler, C.: Renderedtext dataset. HuggingFace dataset (2023)
2023
-
[42]
arXiv preprint arXiv:2502.14846 , year=
Yang, Y., Patel, A., Deitke, M., Gupta, T., Weihs, L., Head, A., Yatskar, M., Callison-Burch, C., Krishna, R., Kembhavi, A., Clark, C.: Scaling text-rich im- age understanding via code-guided synthetic multimodal data generation. arXiv preprint arXiv:2402.XXXX (2025),https://arxiv.org/abs/2502.14846
-
[43]
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert (2020),https://arxiv.org/abs/1904.09675
work page internal anchor Pith review arXiv 2020
-
[44]
Zheng, R., et al.: Mmmb: A multilingual multimodal benchmark for vision- language models. arXiv preprint arXiv:2403.12345 (2024) Multilingual Training and Evaluation Resources 17 7 Appendix 7.1 Supervised Evaluation on Multi-PixMo To validate the quality of the generated multilingual data, we evaluate our mod- elsonheld-outtestsplitsofMulti-PixMo-CapandMu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.