pith. machine review for the scientific record. sign in

arxiv: 2604.11589 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

MLLM-as-a-Judge Exhibits Model Preference Bias

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords MLLM-as-a-Judgepreference biasself-preferenceautomatic evaluationmultimodal modelsmodel familiesensemble evaluation
0
0 comments X

The pith

MLLM judges exhibit self-preference bias toward their own outputs and those from related model families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Philautia-Eval to quantify how much MLLM-as-a-Judge methods favor text generated by specific models. It separates these preference tendencies from actual differences in generation quality across 1.29 million caption-score pairs from 12 models. Results show clear self-preference bias, plus mutual biases within model families that may arise from shared connectors and instruction-tuning data. A simple ensemble called Pomms reduces the bias while preserving evaluation performance.

Core claim

Representative MLLMs tend to exhibit self-preference bias when acting as judges, with mutual preference bias within particular model families potentially driven by reused connectors and overlapping instruction-tuning resources; these biases can be quantified via Philautia-Eval and mitigated by an ensemble of MLLMs.

What carries the argument

Philautia-Eval, a method that disentangles model preference tendencies from genuine differences in generation quality using large-scale paired evaluations.

If this is right

  • Single-M LLM judge benchmarks may systematically distort performance comparisons between models.
  • Model families sharing training components show correlated biases in automatic evaluations.
  • Ensemble judges like Pomms can serve as a practical way to reduce bias in evaluation pipelines.
  • Evaluation protocols relying on MLLM judges require explicit checks for model-specific preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analogous self-preference effects are likely present when using LLMs as judges in text-only settings.
  • Developers could reduce downstream bias by diversifying connectors and instruction data across models.
  • Extending Philautia-Eval to other modalities or tasks would test whether the bias pattern generalizes.

Load-bearing premise

Philautia-Eval successfully disentangles model preference tendencies from genuine differences in generation quality without introducing new artifacts.

What would settle it

An experiment where generation quality is first verified as equal by humans or independent metrics across models, then checking whether Philautia-Eval still detects preference biases.

Figures

Figures reproduced from arXiv: 2604.11589 by Daichi Yashima, Komei Sugiura, Shuitsu Koyama, Yuiga Wada.

Figure 1
Figure 1. Figure 1: Schematic of our approach for investigating model-specific preference [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example of self-preference bias in MLLM-as-a-Judge. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of Φ˜ in the (i) reference-based and (ii) reference-free settings. All philautia scores (diagonal items) were greater than zero, indicating the presence of self-preference bias within the MLLMs used in our experiments. the dataset includes 45,000 human-written captions that are used as references. These captions were curated from the nocaps dataset and have a vocabulary size of 11,404 words, … view at source ↗
Figure 5
Figure 5. Figure 5: Example of self-preference bias. The bar chart shows the scores given to a caption generated by Gemini-2.5-Pro. Gemini-2.5-Pro exceptionally gave high scores to its own generations compared with the other Evaluators. The symbol ♦ represents the mean value of the scores by each Evaluator. Red text within yˆg highlights hallu￾cination. of their philautia scores from their respective means. Specifically, the … view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of preference bias within model families. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Philautia-Eval, a method to quantify model-specific preference bias in MLLM-as-a-Judge by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs from 12 MLLMs, it reports self-preference bias in representative models and mutual preference bias within model families, potentially attributable to reused connectors and overlapping instruction-tuning data. It further introduces Pomms, a simple ensemble of MLLMs that mitigates the measured bias while preserving evaluation performance.

Significance. If the disentangling procedure in Philautia-Eval is robust, the work identifies a practically important limitation in the growing use of MLLMs for automatic multimodal evaluation, which could otherwise distort model rankings and benchmark-driven research. The scale of the study (1.29M pairs across 12 models) and the proposed mitigation via ensemble provide concrete, actionable contributions. The findings on family-wise bias also open avenues for understanding training-data overlap effects in multimodal models.

major comments (3)
  1. [§3] §3 (Philautia-Eval): The central claim that the method successfully disentangles preference bias from genuine quality differences rests on an unspecified normalization or regression step. No explicit equations, pseudocode, or ablation on residual correlation with judge training data are provided, leaving open the possibility that measured self-preference is partly an artifact of shared generation/scoring pipelines.
  2. [§4.2] §4.2 (Results on 1.29M pairs): The reported self-preference and family-wise mutual bias figures lack accompanying statistical controls (e.g., permutation tests, multiple-comparison correction across 12 models, or independent quality oracle) that would confirm the bias is not driven by unaccounted confounders in caption generation.
  3. [§5] §5 (Causal interpretation): The statement that mutual bias is 'potentially driven by reused connectors and overlapping instruction-tuning resources' is presented without any supporting analysis (data-overlap metrics, connector ablation, or controlled fine-tuning experiments), weakening the explanatory claim even if the bias measurement itself holds.
minor comments (2)
  1. [Abstract] Abstract and §2: The names 'Philautia-Eval' and 'Pomms' are introduced without expansion or motivation, which reduces immediate readability for readers unfamiliar with the Greek root or acronym.
  2. [Figure 2] Figure 2 or equivalent bias heatmap: Error bars or confidence intervals are missing from the per-model bias scores, making it difficult to judge the reliability of the reported differences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and suggestions. We provide point-by-point responses to the major comments below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Philautia-Eval): The central claim that the method successfully disentangles preference bias from genuine quality differences rests on an unspecified normalization or regression step. No explicit equations, pseudocode, or ablation on residual correlation with judge training data are provided, leaving open the possibility that measured self-preference is partly an artifact of shared generation/scoring pipelines.

    Authors: We agree that the disentangling procedure requires more explicit documentation. In the revised version, we will add the full set of equations describing the normalization and regression steps used in Philautia-Eval, include pseudocode for the algorithm, and perform an ablation analysis to check for residual correlations with the training data of the judge models. This will address concerns about potential artifacts from shared pipelines. revision: yes

  2. Referee: [§4.2] §4.2 (Results on 1.29M pairs): The reported self-preference and family-wise mutual bias figures lack accompanying statistical controls (e.g., permutation tests, multiple-comparison correction across 12 models, or independent quality oracle) that would confirm the bias is not driven by unaccounted confounders in caption generation.

    Authors: We thank the referee for this valuable suggestion. We will enhance §4.2 by adding permutation tests to validate the significance of the bias measurements and apply appropriate multiple-comparison corrections for the 12 models. While we do not have an independent quality oracle in the current study, the large scale of the 1.29M caption-score pairs helps control for confounders; we will explicitly discuss this in the revision and note it as a limitation. revision: partial

  3. Referee: [§5] §5 (Causal interpretation): The statement that mutual bias is 'potentially driven by reused connectors and overlapping instruction-tuning resources' is presented without any supporting analysis (data-overlap metrics, connector ablation, or controlled fine-tuning experiments), weakening the explanatory claim even if the bias measurement itself holds.

    Authors: We recognize that the explanatory claim is not supported by direct analysis. In the revision, we will modify the language in §5 to present this as a hypothesis rather than a firm attribution, and we will include a discussion on how future work could use data-overlap metrics or ablations to investigate this. The core bias measurements remain valid independently of this interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical bias measurement via new disentangling method on collected data

full rationale

The paper proposes Philautia-Eval as a new framework to quantify model-specific preference bias by disentangling it from generation quality differences, then applies it to an independently collected dataset of 1.29M caption-score pairs across 12 MLLMs. The self-preference and family-wise mutual bias findings are presented as direct experimental observations from this evaluation, with an additional ensemble method (Pomms) introduced to mitigate observed bias. No equations, fitted parameters, or self-citations are described that would reduce the bias quantification or central claims to tautological inputs by construction. The derivation chain consists of data collection followed by application of the proposed disentangling procedure, which is external to the measured outputs and does not invoke prior author work as a uniqueness theorem or ansatz. This is a standard empirical study without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The paper introduces two new named constructs (Philautia-Eval and Pomms) whose validity rests on the unstated details of the disentangling procedure. No free parameters or mathematical axioms are mentioned in the abstract.

invented entities (2)
  • Philautia-Eval no independent evidence
    purpose: Quantify model-specific preference bias by disentangling preference from generation quality
    New method proposed in the paper
  • Pomms no independent evidence
    purpose: Ensemble of MLLMs to mitigate model preference bias
    New mitigation approach introduced in the paper

pith-pipeline@v0.9.0 · 5499 in / 1122 out tokens · 41695 ms · 2026-05-10T15:38:45.991571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Abdin, M., et al.: Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 (2024)

  2. [2]

    In: EMNLP

    Adilazuarda, M., Mukherjee, S., Lavania, P., Singh, S., Aji, A., O’Neill, J., Modi, A., Choudhury, M.: Towards Measuring and Modeling “Culture” in LLMs: A Sur- vey. In: EMNLP. pp. 15763–15784 (2024)

  3. [3]

    From images to sentences through scene description graphs using commonsense reasoning and knowledge,

    Aditya, S., Yang, Y., Baral, C., Aloimonos, Y.: From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge. arXiv:1511.03292 (2015)

  4. [4]

    In: ICCV

    Agrawal, H., Desai, K., et al.: nocaps: Novel Object Captioning at Scale. In: ICCV. pp. 8948–8957 (2019)

  5. [5]

    In: ICLR (2024)

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. In: ICLR (2024)

  6. [6]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., et al.: Qwen2.5-VL Technical Report. arXiv:2502.13923 (2025)

  7. [7]

    In: EMNLP

    Chan, D., Petryk, S., et al.: CLAIR: Evaluating Image Captions with Large Lan- guage Models. In: EMNLP. pp. 13638–13646 (2023)

  8. [8]

    In: ICML

    Chen, D., Chen, R., Zhang, S., Wang, Y., Liu, Y., Zhou, H., Zhang, Q., Wan, Y., Zhou, P., Sun, L.: MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-language Benchmark. In: ICML. vol. 235, pp. 6562–6595 (2024)

  9. [9]

    In: EMNLP

    Chen, G., Chen, S., Liu, Z., Jiang, F., Wang, B.: Humans or LLMs as the Judge? A Study on Judgement Bias. In: EMNLP. pp. 8301–8327 (2024)

  10. [10]

    Chen, W., Wei, Z., Zhu, X., Feng, S., Meng, Y.: Do LLM Evaluators Prefer Them- selves for a Reason? arXiv:2504.03846 (2025)

  11. [11]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., et al.: Expanding Per- formance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling. arXiv:2412.05271 (2024)

  12. [12]

    In: EMNLP

    Chen, Z., Wang, H., Zhang, X., Hu, E., Lin, Y.: Beyond the Surface: Measuring Self-Preference in LLM Judgments. In: EMNLP. pp. 1653–1672 (2025)

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., et al.: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabili- ties. arXiv:2507.06261 (2025)

  14. [14]

    In: CVPR

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J., Salehi, M., et al.: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision- Language Models. In: CVPR. pp. 91–104 (2025)

  15. [15]

    In: NAACL

    Fu, J., Ng, S., Jiang, Z., et al.: GPTScore: Evaluate as You Desire. In: NAACL. pp. 6556–6576 (2024)

  16. [16]

    Computational Linguistics50(3), 1097–1179 (2024)

    Gallegos, I., Rossi, R., Barrow, J., Tanjim, M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., et al.: Bias and Fairness in Large Language Models: A Survey. Computational Linguistics50(3), 1097–1179 (2024)

  17. [17]

    The Innovation (2024) 16 S

    Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al.: A Survey on LLM-as-a-Judge. The Innovation (2024) 16 S. Koyama et al

  18. [18]

    In: AAAI (2026)

    Hirano, S., Wada, Y., Matsuda, K., Otsuki, S., Sugiura, K.: LLM-Free Image Cap- tioning Evaluation in Reference-Flexible Settings. In: AAAI (2026)

  19. [19]

    JAIR47, 853–899 (2013)

    Hodosh, M., et al.: Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. JAIR47, 853–899 (2013)

  20. [20]

    In: ICLR Workshop (2025)

    Hu, Z., Song, L., Zhang, J., Xiao, Z., Chen, Z., Xiong, H.: Explaining Length Bias in LLM-Based Preference Evaluations. In: ICLR Workshop (2025)

  21. [21]

    GPT-4o System Card

    Hurst,A.,Lerer,A.,Goucher,A.,Perelman,A.,Ramesh,A.,etal.:GPT-4oSystem Card. arXiv:2410.21276 (2024)

  22. [22]

    AAAI (2026)

    Inoue, N., Goto, K., Oi, M., Gruszka, M., Ukai, M., Hirose, T., Sekikawa, Y.: DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning. AAAI (2026)

  23. [23]

    Visual Intelligence3(1), 27 (2025)

    Jin, Y., Li, J., Gu, T., Liu, Y., Zhao, B., Lai, J., Gan, Z., Wang, Y., Wang, C., Tan, X., et al.: Efficient Multimodal Large Language Models: A Survey. Visual Intelligence3(1), 27 (2025)

  24. [24]

    Kim, H., Kim, S., Jeong, J., Cho, Y., Cho, S.: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations. In: ACL. pp. 26642– 26657 (2025)

  25. [25]

    Krasin, I., Duerig, T., Alldrin, N., Veit, A., Abu-El-Haija, S., Belongie, S., Cai, D., Feng, Z., Ferrari, V., Gomes, V.: OpenImages: A Public Dataset for Large-Scale Multi-Label and Multi-Class Image Classification. (2016)

  26. [26]

    PNAS122(31), e2415697122 (2025)

    Laurito, W., Davis, B., et al.: AI–AI Bias: Large Language Models Favor Com- munications Generated by Large Language Models. PNAS122(31), e2415697122 (2025)

  27. [27]

    Lee, Y., Park, L., Kang, M.: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In: ACL. pp. 3732– 3746 (2024)

  28. [28]

    arXiv:2403.18771 (2024)

    Lee, Y., et al.: CheckEval: A Reliable LLM-as-a-Judge Framework for Evaluating Text Generation Using Checklists. arXiv:2403.18771 (2024)

  29. [29]

    TMLR (2024)

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: LLaVA-OneVision: Easy Visual Task Transfer. TMLR (2024)

  30. [30]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Li,H.,Dong,Q.,Chen,J.,Su,H.,Zhou,Y.,Ai,Q.,Ye,Z.,Liu,Y.:LLMs-as-Judges: A Comprehensive Survey on LLM-Based Evaluation Methods. arXiv:2412.05579 (2024)

  31. [31]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

    Li, Z., Chen, G., Liu, S., Wang, S., VS, V., Ji, Y., Lan, S., Zhang, H., et al.: Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models. arXiv:2501.14818 (2025)

  32. [32]

    Benchmark evalua- tions, applications, and challenges of large vision language models: A survey,

    Li, Z., Wu, X., Du, H., Liu, F., Nghiem, H., Shi, G.: A Survey of State-of-the- Art Large Vision Language Models: Alignment, Benchmarks, Evaluations, and Challenges. arXiv:2501.02189 (2025)

  33. [33]

    In: ECCV

    Lin, T., Maire, M., Belongie, S., Bourdev, L., Girshick, R., et al.: Microsoft COCO: Common Objects in Context. In: ECCV. pp. 740–755 (2014)

  34. [34]

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, J.: LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge (2024)

  35. [35]

    In: CVPR

    Liu, H., et al.: Improved Baselines with Visual Instruction Tuning. In: CVPR. pp. 26296–26306 (2024)

  36. [36]

    In: EMNLP

    Liu, Y., Iter, D., Xu, Y., Wang, S., et al.: G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In: EMNLP. pp. 2511–2522 (2023)

  37. [37]

    Liu, Y., Moosavi, S., Lin, C.: LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores. In: ACL. pp. 12688–12701 (2024) MLLM-as-a-Judge Exhibits Model Preference Bias 17

  38. [38]

    In: EMNLP

    Matsuda, K., Wada, Y., Hirano, S., Otsuki, S., Sugiura, K.: VELA: An LLM- Hybrid-as-a-Judge Approach for Evaluating Long Image Captions. In: EMNLP. pp. 8680–8696 (2025)

  39. [39]

    In: ACCV

    Matsuda, K., et al.: DENEB: A Hallucination-Robust Automatic Evaluation Met- ric for Image Captioning. In: ACCV. pp. 3570–3586 (2024)

  40. [40]

    Mordor Intelligence: Large Language Model (LLM) Market Size & Share Anal- ysis (2026),https://www.mordorintelligence.com/industry-reports/large- language-model-llm-market

  41. [41]

    In: EMNLP

    Nangia, N., et al.: CrowS-pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In: EMNLP. pp. 1953–1967 (2020)

  42. [42]

    In: ACL Findings

    Ohi, M., et al.: Likelihood-based Mitigation of Evaluation Bias in Large Language Models. In: ACL Findings. pp. 3237–3245 (2024)

  43. [43]

    In: NeurIPS

    Panickssery, A., Bowman, S., Feng, S.: LLM Evaluators Recognize and Favor Their Own Generations. In: NeurIPS. vol. 37, pp. 68772–68802 (2024)

  44. [44]

    Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach

    Ranjan, R., Gupta, S., Singh, S.: A Comprehensive Survey of Bias in LLMs: Cur- rent Landscape and Future Directions. arXiv:2409.16430 (2024)

  45. [45]

    In: CVPR

    Sarto, S., Barraco, M., et al.: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In: CVPR. pp. 6914–6924 (2023)

  46. [46]

    In: IJCV (2025)

    Sarto,S.,Moratelli,N.,etal.:Positive-AugmentedContrastiveLearningforVision- and-Language Evaluation and Training. In: IJCV (2025)

  47. [47]

    In: NAACL

    Shen, S., Logeswaran, L., Lee, M., Lee, H., Poria, S., Mihalcea, R.: Understanding the Capabilities and Limitations of Large Language Models for Cultural Common- sense. In: NAACL. pp. 5668–5680 (2024)

  48. [48]

    In: AACL

    Shi, L., Ma, C., Liang, W., Diao, X., et al.: Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. In: AACL. pp. 292–314 (2025)

  49. [49]

    Gemma 3 Technical Report

    Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 Technical Report. arXiv:2503.19786 (2025)

  50. [50]

    In: AAAI

    Tong, T., He, S., et al.: G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o. In: AAAI. vol. 39, pp. 7419–7427 (2025)

  51. [51]

    In: CVPR

    Wada, Y., Kanta, K., et al.: Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning. In: CVPR. pp. 13559–13568 (2024)

  52. [52]

    Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Liu, T., et al.: Large Language Models Are Not Fair Evaluators. In: ACL. pp. 9440–9450 (2024)

  53. [53]

    In: NeurIPS Workshop (2024)

    Wataoka, K., Takahashi, T., Ri, R.: Self-Preference Bias in LLM-as-a-Judge. In: NeurIPS Workshop (2024)

  54. [54]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., et al.: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding. arXiv:2412.10302 (2024)

  55. [55]

    Xu, W., Zhu, G., Zhao, X., et al.: Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. In: ACL. pp. 15474–15492 (2024)

  56. [56]

    In: ECCV

    Yao, Z., et al.: HiFi-Score: Fine-Grained Image Description Evaluation with Hier- archical Parsing Graphs. In: ECCV. pp. 441–458 (2024)

  57. [57]

    National Science Review11(12) (2024)

    Yin, S., Fu, C., Zhao, S., Li, K., et al.: A Survey on Multimodal Large Language Models. National Science Review11(12) (2024)

  58. [58]

    In: NeurIPS

    Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: NeurIPS. vol. 36, pp. 46595–46623 (2023)

  59. [59]

    arXiv:2509.25373 (2025) 18 S

    Zhou, C., et al.: From Perception to Cognition: A Survey of Vision-Language Inter- active Reasoning in Multimodal Large Language Models. arXiv:2509.25373 (2025) 18 S. Koyama et al

  60. [60]

    rice crispy balls

    Zhu, L., Wang, X., Wang, X.: JudgeLM: Fine-Tuned Large Language Models Are Scalable Judges. In: ICLR (2025) MLLM-as-a-Judge Exhibits Model Preference Bias Shuitsu Koyama⋆, Yuiga Wada⋆, Daichi Yashima, and Komei Sugiura Keio University, Japan {koyamashu3, yuiga, ydaichi1207, komei.sugiura}@keio.jp A Details of Experimental Setup A.1 Generators and Evaluato...

  61. [61]

    Carefully observe the provided image to understand its main content

  62. [62]

    Read the reference captions carefully to identify the important information they highlight

  63. [63]

    Compare the generated caption to both the reference captions and the visual content of the image

  64. [64]

    Assess how well the generated caption covers the main points of the visual con- tent and the reference captions, and how much irrelevant or redundant information it contains

  65. [65]

    Please remember the score

    Assign an integer score from 0 to 100, considering both the alignment with the image and the inclusion of key points from the references. Please remember the score. Reference captions: {{Reference}} Image is attached Generated captions: {{Caption}} Response Format: You should first give a detailed reason for your score, ending with a sentence like this: T...

  66. [66]

    Carefully observe the image provided

  67. [67]

    Identify the main points of the visual content in the image

  68. [68]

    Assess how well the generated caption covers the main points of the visual content, and how much irrelevant or redundant information it contains

  69. [69]

    Generated captions: {{Caption}} Response Format: You should first give detailed reason for your score, and ending with sentence like this: The final score is ${{score}}$

    Assign an integer score from 0 to 100, please remember it. Generated captions: {{Caption}} Response Format: You should first give detailed reason for your score, and ending with sentence like this: The final score is ${{score}}$. Note that the score should be an integer from 0 to 100, and should be wrapped in the dollar signs ($)