Recognition: no theorem link
Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
Pith reviewed 2026-05-15 12:31 UTC · model grok-4.3
The pith
Jointly training multimodal LLM judges on multiple tasks with reinforcement learning improves consistency and human correlation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MT-RL-Judge jointly optimizes an MLLM-as-a-Judge across multiple tasks by reinforcement learning, producing higher judgment consistency and stronger correlation with human preferences than single-task baselines while also generalizing robustly to out-of-distribution tasks.
What carries the argument
The MT-RL-Judge framework, which applies multi-task reinforcement learning to jointly optimize a single multimodal LLM judge across diverse visual evaluation tasks.
If this is right
- Judge models become more consistent across varied visual tasks without separate per-task training.
- Correlation with human preferences rises relative to existing single-task baselines.
- Performance holds up on tasks outside the training distribution.
- A single trained judge can replace multiple task-specific models for evaluation pipelines.
Where Pith is reading between the lines
- The same multi-task RL pattern could be tested on text-only LLM judges to see whether task diversity helps there too.
- If more tasks are added during training, generalization might continue to improve up to some saturation point.
- The approach suggests that RL-based judges could eventually serve as drop-in evaluators for entirely new multimodal domains with little additional adaptation.
Load-bearing premise
That training on the chosen combination of tasks with reinforcement learning will create positive transfer and avoid negative interference or overfitting to any single task.
What would settle it
A new benchmark of out-of-distribution visual judgment tasks where the MT-RL-Judge model shows lower human preference correlation than a comparable single-task trained judge.
read the original abstract
Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MT-RL-Judge, a multi-task reinforcement learning framework that jointly optimizes multimodal LLMs as judges across multiple tasks. It claims that this yields better judgment consistency and higher correlation with human preferences than single-task RL baselines, along with robust generalization to out-of-distribution tasks.
Significance. If the reported gains hold under the described experimental conditions, the work would demonstrate that multi-task RL can produce positive transfer for MLLM judges without measurable negative interference, offering a practical route to more reliable automated evaluation across visual tasks.
minor comments (3)
- [Abstract] Abstract: the summary of results would be strengthened by including at least one concrete metric (e.g., the reported gain in human correlation or consistency score) rather than qualitative statements alone.
- [Method] The description of the multi-task reward formulation should explicitly state how task-specific rewards are combined or balanced during joint optimization.
- [Experiments] Table or figure captions for the OOD generalization results should list the exact held-out tasks and the number of evaluation samples per task.
Simulated Author's Rebuttal
We thank the referee for their positive summary of MT-RL-Judge, the assessment of its significance, and the recommendation for minor revision. No major comments were provided in the report.
Circularity Check
No significant circularity identified
full rationale
The paper is entirely empirical, presenting MT-RL-Judge as a multi-task RL framework evaluated via direct experimental comparisons to single-task baselines and held-out OOD tasks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. Performance gains in consistency and human correlation are reported from ablation studies and generalization metrics, which remain externally falsifiable and independent of any self-referential construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. In Forty-first International Conference on Machine Learning, 2024a. 6 Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen We...
-
[2]
Mme-survey: A comprehensive survey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296,
Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, et al. Mme-survey: A comprehensive survey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296,
-
[3]
Juntao Gu, Xinran Zhao, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Tiancheng Gu, Kaicheng Yang, Kaichen Zhang, Xiang An, Ziyong Feng, Yueyi Zhang, Weidong Cai, Jiankang Deng, and Lidong Bing. Unime-v2: Mllm-as-a-judge for universal multimodal embedding learning.arXiv preprint arXiv:2510.13515,
-
[5]
G-eval: Nlg evaluation using gpt-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522,
work page 2023
-
[6]
Reuben A Luera, Ryan Rossi, Franck Dernoncourt, Samyadeep Basu, Sungchul Kim, Subhojyoti Mukherjee, Puneet Mathur, Ruiyi Zhang, Jihyung Kil, Nedim Lipka, et al. Mllm as a ui judge: Benchmarking multimodal llms for predicting human perception of user interfaces.arXiv preprint arXiv:2510.08783,
-
[7]
Renjie Pi, Haoping Bai, Qibin Chen, Xiaoming Simon Wang, Jiulong Shan, Xiaojiang Liu, and Meng Cao. Mr. judge: Multimodal reasoner as a judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20192–20216,
work page 2025
-
[8]
Unsafebench: Benchmarking image safety classifiers on real-world and ai-generated images
Yiting Qu, Xinyue Shen, Yixin Wu, Michael Backes, Savvas Zannettou, and Yang Zhang. Unsafebench: Benchmarking image safety classifiers on real-world and ai-generated images. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 3221–3235,
work page 2025
-
[9]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[11]
Confprobench: A confidence evaluation benchmark for mllm-based process judges
Yue Zhou, Yi Chang, and Yuan Wu. Confprobench: A confidence evaluation benchmark for mllm-based process judges. arXiv preprint arXiv:2508.04576,
-
[12]
8 Appendix A Training Configurations For the SFT stage, we utilize the LLaMA-Factory framework (Zheng et al., 2024), while for Reinforcement Learning (RL), we employ EasyR1 (Yaowei Zheng, 2025). All MLLM-as-a-Judge models in our experiments are initialized from theQwen/Qwen3-VL-30B-A3B-Instructbase model. Specifically, SFT is performed via full-parameter ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.