Recognition: unknown
Lost in Translation: Do LVLM Judges Generalize Across Languages?
Pith reviewed 2026-05-10 02:14 UTC · model grok-4.3
The pith
LVLM judges exhibit inconsistent performance across 25 languages, with model size and architecture proving poor predictors of robustness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By releasing MM-JudgeBench with two complementary subsets—one extending VL-RewardBench for general vision-language preferences and one derived from OpenCQA for chart-centric reasoning—the authors demonstrate that LVLM judges display substantial performance variance across 25 languages. Their analysis of 15 open-source and 7 proprietary models shows that neither larger scale nor specific architectures reliably improve multilingual consistency, and that even leading judges produce inconsistent preference decisions depending on the input language.
What carries the argument
MM-JudgeBench, the benchmark of over 60K pairwise preference instances across 25 languages formed by extending existing English vision-language and chart-reasoning datasets.
If this is right
- Reward models for LVLMs require explicit multilingual testing rather than English-only validation to ensure reliable use.
- Current state-of-the-art LVLM judges cannot be deployed uniformly for non-English alignment or evaluation tasks without additional adaptation.
- The released multilingual training split from MM-RewardBench offers a starting point for domain adaptation to reduce cross-lingual variance.
- Architectural scaling alone will not resolve inconsistent behavior in multilingual settings.
Where Pith is reading between the lines
- The same generalization gaps likely appear in related multilingual alignment tasks such as safety and instruction following for LVLMs.
- Future benchmarks could incorporate direct native-speaker creation of preference pairs instead of translation to isolate true cross-lingual effects.
- The variance across languages points to imbalances in the visual-text pretraining data that reward models inherit.
Load-bearing premise
The benchmark's translated and extended preference instances accurately reflect native speaker judgments without introducing significant translation artifacts or cultural biases.
What would settle it
A direct comparison in which native speakers of several benchmark languages re-annotate the same image-text pairs and show low agreement with the existing preference labels would indicate the observed inconsistencies arise from benchmark construction rather than model behavior.
Figures
read the original abstract
Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MM-JudgeBench, the first large-scale benchmark for multilingual multimodal LVLM judge evaluation, comprising over 60K pairwise preference instances across 25 typologically diverse languages. It integrates a general vision-language preference subset extending VL-RewardBench and a chart-centric visual-text reasoning subset derived from OpenCQA. The authors evaluate 22 LVLMs (15 open-source, 7 proprietary), report substantial cross-lingual performance variance, conclude that model size and architecture are poor predictors of multilingual robustness, and note inconsistent behavior even in state-of-the-art judges. They additionally release a disjoint multilingual training set derived from MM-RewardBench to support domain adaptation.
Significance. If the benchmark labels prove reliable, this work is significant for demonstrating fundamental limitations in the cross-lingual generalization of LVLM-based reward models, which are central to alignment pipelines. The scale (60K+ instances, 25 languages), complementary subsets, and public release of both the evaluation benchmark and training data are clear strengths that enable reproducible research and targeted improvements in multilingual evaluator development.
major comments (2)
- [§3] §3 (Benchmark Construction): The extension of English-only sources (VL-RewardBench and OpenCQA) to 25 languages is described as adding translated text for the same visual inputs, but no details are provided on whether preference labels were re-annotated by native speakers, back-translated for fidelity, or validated against cultural/lexical artifacts. This is load-bearing for the central claim of model inconsistency across languages, because translation-induced shifts in preference ordering would produce apparent variance that reflects label noise rather than LVLM limitations.
- [§4] §4 (Experimental Results): The analysis that model size and architecture are poor predictors of multilingual robustness lacks any reported statistical support such as correlation coefficients, regression results, or ablation studies. In addition, the abstract and results summary provide no error bars, confidence intervals, or significance tests for the cross-lingual performance differences, weakening the evidence for the reported 'substantial variance' and 'inconsistent behavior'.
minor comments (2)
- [Abstract] Abstract: The exact total instance count and per-subset or per-language breakdown would improve transparency over the approximate 'over 60K' figure.
- [Introduction] Introduction: Early definition of the term 'LVLM judges' and explicit expansion of acronyms such as VL-RewardBench on first use would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, indicating the changes we will make in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The extension of English-only sources (VL-RewardBench and OpenCQA) to 25 languages is described as adding translated text for the same visual inputs, but no details are provided on whether preference labels were re-annotated by native speakers, back-translated for fidelity, or validated against cultural/lexical artifacts. This is load-bearing for the central claim of model inconsistency across languages, because translation-induced shifts in preference ordering would produce apparent variance that reflects label noise rather than LVLM limitations.
Authors: We appreciate the referee highlighting the need for greater transparency in benchmark construction. The manuscript describes extending the English sources by translating the textual elements while retaining the identical visual inputs and the original English preference labels. We did not re-annotate the labels with native speakers for the 25 languages, as the goal was to evaluate judge behavior on equivalent preference judgments across languages. We acknowledge that this design choice leaves open the possibility of translation-induced artifacts. In the revision we will expand §3 with: (i) the exact translation pipeline and service employed, (ii) results of back-translation fidelity checks performed on a sampled subset, and (iii) an explicit limitations paragraph discussing potential cultural or lexical shifts. These additions will allow readers to evaluate whether the reported cross-lingual variance is attributable to label noise. revision: yes
-
Referee: [§4] §4 (Experimental Results): The analysis that model size and architecture are poor predictors of multilingual robustness lacks any reported statistical support such as correlation coefficients, regression results, or ablation studies. In addition, the abstract and results summary provide no error bars, confidence intervals, or significance tests for the cross-lingual performance differences, weakening the evidence for the reported 'substantial variance' and 'inconsistent behavior'.
Authors: We agree that formal statistical support would strengthen the claims. The current manuscript demonstrates variance through per-language performance tables and figures, but does not include correlation analyses or error bars. In the revised version we will add to §4: Pearson and Spearman correlations between model parameter count and cross-lingual performance standard deviation; analogous breakdowns by architecture family; standard-error bars on all relevant plots; and paired significance tests (Wilcoxon signed-rank) between languages for representative models. The abstract will be updated to reference these quantitative supports for the variance and inconsistency findings. revision: yes
Circularity Check
No circularity: purely empirical benchmark construction and model evaluation
full rationale
The paper introduces MM-JudgeBench by extending VL-RewardBench and OpenCQA to 25 languages via translation and evaluates 22 LVLMs on over 60K preference instances. All central claims (cross-lingual variance, poor predictive power of size/architecture) rest on new data collection and direct performance measurements rather than any derivation, equation, fitted parameter renamed as prediction, or self-citation chain. No mathematical steps, ansatzes, or uniqueness theorems are invoked; the work is self-contained against external benchmarks and does not reduce any result to its inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[2]
BERTScore: Evaluating Text Generation with BERT , author=
-
[3]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=
Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=
2024
-
[4]
Transformers: State-of-the-Art Natural Language Processing
Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...
-
[5]
Proceedings of the 29th symposium on operating systems principles , pages=
Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
-
[6]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Pixtral 12B , author=. arXiv preprint arXiv:2410.07073 , year=
work page internal anchor Pith review arXiv
-
[8]
Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2509.13332 , year=
Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness , author=. arXiv preprint arXiv:2509.13332 , year=
work page internal anchor Pith review arXiv
-
[10]
A Systematic Study and Comprehensive Evaluation of C hat GPT on Benchmark Datasets
Laskar, Md Tahmid Rahman and Bari, M Saiful and Rahman, Mizanur and Bhuiyan, Md Amran Hossen and Joty, Shafiq and Huang, Jimmy Xiangji. A Systematic Study and Comprehensive Evaluation of C hat GPT on Benchmark Datasets. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.29
-
[11]
Jahan, Israt and Laskar, Md Tahmid Rahman and Peng, Chun and Huang, Jimmy. Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. 2023. doi:10.18653/v1/2023.bionlp-1.30
-
[12]
Journal of artificial intelligence research , volume=
Reinforcement learning: A survey , author=. Journal of artificial intelligence research , volume=
-
[13]
Improving automatic evaluation of large language models (
Laskar, Md Tahmid Rahman and Jahan, Israt and Dolatabadi, Elham and Peng, Chun and Hoque, Enamul and Huang, Jimmy , booktitle=. Improving automatic evaluation of large language models (
-
[14]
Computational Linguistics , volume=
Domain adaptation with pre-trained transformers for query-focused abstractive text summarization , author=. Computational Linguistics , volume=. 2022 , publisher=
2022
-
[15]
Laskar, Md Tahmid Rahman and Hoque, Enamul and Huang, Jimmy Xiangji. WSL - DS : Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.495
-
[16]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=
work page internal anchor Pith review arXiv
-
[17]
Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers) , pages=
Filtering and mining parallel data in a joint multilingual space , author=. Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers) , pages=
-
[18]
Forty-first International Conference on Machine Learning , year=
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , author=. Forty-first International Conference on Machine Learning , year=
-
[19]
A Little is Enough
“A Little is Enough”: Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=
2023
-
[20]
Evaluating
Padarha, Shreyansh and Semenova, Elizaveta and Vidgen, Bertie and Mahdi, Adam and Hale, Scott A , booktitle=. Evaluating
-
[21]
Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
2024
-
[23]
Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=
CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task , author=. Proceedings of the Seventh Conference on Machine Translation (WMT) , pages=
2022
-
[24]
doi: 10.18653/v1/2022.acl-long.62
Feng, Fangxiaoyu and Yang, Yinfei and Cer, Daniel and Arivazhagan, Naveen and Wang, Wei. Language-agnostic BERT Sentence Embedding. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.62
-
[25]
Proceedings of the Tenth Conference on Machine Translation , pages=
Findings of the wmt25 general machine translation shared task: Time to stop evaluating on easy test sets , author=. Proceedings of the Tenth Conference on Machine Translation , pages=
-
[26]
Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models
Alqahtani, Sawsan and Nayeem, Mir Tafseer and Laskar, Md Tahmid Rahman and Mohiuddin, Tasnim and Bari, M Saiful. Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.1...
-
[27]
2025 , eprint=
Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation , author=. 2025 , eprint=
2025
-
[28]
Findings of the association for computational linguistics ACL 2024 , pages=
Prometheus-vision: Vision-language model as a judge for fine-grained evaluation , author=. Findings of the association for computational linguistics ACL 2024 , pages=
2024
-
[29]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Llava-critic: Learning to evaluate multimodal models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[30]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
M-rewardbench: Evaluating reward models in multilingual settings , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[31]
Multimodal rewardbench: Holistic evaluation of reward models for vision language models , author=. arXiv preprint arXiv:2502.14191 , year=
-
[32]
National Science Review , volume=
A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=
2024
-
[33]
Patterns , volume=
A survey of multilingual large language models , author=. Patterns , volume=. 2025 , publisher=
2025
-
[34]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[35]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
Opencqa: Open-ended question answering with charts , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
2022
-
[36]
Constitutional AI: Harmlessness from AI Feedback
Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
From generation to judgment: Opportunities and challenges of
Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and others , booktitle=. From generation to judgment: Opportunities and challenges of
-
[38]
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and others , journal=. Judging
-
[39]
Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
Rewardbench: Evaluating reward models for language modeling , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
2025
-
[40]
Advances in neural information processing systems , volume=
Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
-
[41]
Laskar, Md Tahmid Rahman and Islam, Mohammed Saidul and Mahbub, Ridwan and Rahman, Mizanur and Bhuiyan, Amran and Jahan, Israt and Nayeem, Mir Tafseer and Joty, Shafiq and Hoque, Enamul and Huang, Jimmy. Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices. Proceedings of the 2025 Conference on Empirical...
-
[42]
Shohan, Faisal Tareque and Nayeem, Mir Tafseer and Islam, Samsul and Akash, Abu Ubaida and Joty, Shafiq. XL - H ead T ags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.771
-
[43]
Laskar, Md Tahmid Rahman and Alqahtani, Sawsan and Bari, M Saiful and Rahman, Mizanur and Khan, Mohammad Abdullah Matin and Khan, Haidar and Jahan, Israt and Bhuiyan, Amran and Tan, Chee Wei and Parvez, Md Rizwan and Hoque, Enamul and Joty, Shafiq and Huang, Jimmy. A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Li...
-
[44]
From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text
Mahbub, Ridwan and Islam, Mohammed Saidul and Nayeem, Mir Tafseer and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Joty, Shafiq and Hoque, Enamul. From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/...
-
[45]
The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models , year=
Mahbub, Ridwan and Islam, Mohammed Saidul and Laskar, Md Tahmid Rahman and Rahman, Mizanur and Nayeem, Mir Tafseer and Hoque, Enamul , booktitle=. The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models , year=
-
[46]
Laskar, Md Tahmid Rahman and Islam, Mohammed Saidul and Mahbub, Ridwan and Masry, Ahmed and Rahman, Mizanur and Bhuiyan, Amran and Nayeem, Mir Tafseer and Joty, Shafiq and Hoque, Enamul and Huang, Jimmy. Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?. Proceedings of the 63rd Annual Meeting of the As...
-
[47]
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.845
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.