pith. sign in

arxiv: 2606.29689 · v1 · pith:YZM73YMSnew · submitted 2026-06-29 · 💻 cs.CL

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

Pith reviewed 2026-06-30 06:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal large language modelsaesthetic critiquereference-based evaluationhuman alignmentopen-ended generationimage grounding control
0
0 comments X

The pith

Reference-based similarity metrics overrate multimodal models' aesthetic critiques by favoring comprehensive style over human selectivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether five open-weight multimodal large language models can generate open-ended aesthetic critiques that align with human ones on the Reddit Photo Critique Dataset. It applies multiple reference-based similarity metrics under varied prompt conditions and introduces a control that supplies each model with the wrong photograph. Stricter lexical and learned metrics reveal only weak alignment with ranked human references, while coarser embedding measures report broad overlap that the control traces to the models' stable writing style rather than image-specific observation. The models also diverge behaviorally by writing longer critiques, covering nearly every aesthetic aspect uniformly, and repeating content where humans are selective and varied. This indicates that current similarity-based evaluations reward fluent comprehensiveness instead of the qualities that define human critique.

Core claim

When scored against ranked human critiques using reference-based metrics, multimodal models show superficial alignment that disappears under stricter lexical and embedding measures. A control experiment feeding wrong photographs reveals that apparent topical overlap stems from a stable model-specific style rather than observation of the given image. Behaviorally, the models generate critiques two to three times longer than humans, address nearly all aesthetic aspects uniformly instead of being selective, and repeat content across different photos where humans vary their responses.

What carries the argument

The wrong-photograph grounding control that isolates whether topical similarity arises from image-specific reasoning or from the model's default critique style.

If this is right

  • Reference-based similarity metrics reward fluent and comprehensive critique styles rather than the selectivity and specificity of human critiques.
  • Multimodal models produce critiques two to three times longer than humans even under length caps and cover nearly every aesthetic aspect where humans remain selective.
  • Models engage each aspect more uniformly and at greater depth while repeating themselves across critiques of the same photo where humans vary.
  • Current evaluation practices for open-ended multimodal generation require methods beyond reference similarity to capture human-like qualities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation of open-ended generation may need metrics that directly test whether specific observations change with the input image rather than measuring style match.
  • The consistent house style observed could limit usefulness in applications where critique must adapt to unique image details.
  • Similar mismatches between model output patterns and human selectivity may occur in other open-ended multimodal tasks such as captioning or visual reasoning.

Load-bearing premise

The ranked human references in the Reddit Photo Critique Dataset form the right target distribution for human-like aesthetic critique and the wrong-image control successfully separates house style from image-specific content.

What would settle it

Models producing substantially different specific observations when given the correct photograph versus the wrong one would challenge the finding that their critiques lack image grounding.

Figures

Figures reproduced from arXiv: 2606.29689 by Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Maryam Amirizaniani, Ramtin Pedarsani, Sajjad Ghiasvand.

Figure 1
Figure 1. Figure 1: Overview of our evaluation pipeline. From the Reddit Photo Critique Dataset we retain [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our primary CAPPED-PLAIN prompt: persona fram￾ing with a 50–150 word constraint and no aspect hint. The image is supplied alongside the text; {body} expands to the post body when present and is empty otherwise. token budget max new tokens, set per condition to ac￾commodate the requested output length. 1 Under every condition each model produces three critiques per post, mir￾roring the three human reference… view at source ↗
Figure 3
Figure 3. Figure 3: Critique length distributions for two representative models under the capped (C [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against numeric scores rather than the written critiques people actually give. We evaluate MLLM critiques against ranked human references and ask whether they are close to human ones. Using the Reddit Photo Critique Dataset, we score five open-weight MLLMs against multiple ranked human critiques per photo with reference-based similarity metrics, under six prompt conditions that disentangle persona framing, aspect hinting, length control, and single- versus multi-pass generation, and add an image-grounding control that feeds each model the wrong photograph. We find that reference-based similarity gives a misleading picture. Stricter lexical and learned metrics show only weak alignment with human critiques, while a coarse embedding cosine reports broad topical overlap that the grounding control traces to a stable house style rather than image-specific observation. Behaviorally, the models diverge from humans in consistent ways the scores do not surface: even under a length cap they write two to three times as much, cover nearly every aesthetic aspect where humans are selective, engage each aspect more uniformly and at greater depth, and repeat themselves across critiques of the same photo where humans vary. We argue that reference-based similarity rewards a fluent, comprehensive critique style rather than the selectivity and specificity of human critique, and discuss implications for evaluating and training open-ended multimodal generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper evaluates five open-weight MLLMs on open-ended aesthetic critique generation against ranked human references from the Reddit Photo Critique Dataset. Using six prompt conditions (disentangling persona, aspect hints, length, and generation passes) plus an image-grounding ablation with wrong photographs, it compares outputs via reference-based similarity metrics. The central claim is that these metrics mislead: stricter lexical/learned metrics show only weak human alignment, while coarse embedding cosine overlap traces to stable model house style rather than image-specific observation. Models also diverge behaviorally by producing longer, less selective, more uniform, and repetitive critiques.

Significance. If the empirical results and controls hold, the work is significant for multimodal evaluation. It shows that reference-based metrics can reward fluent comprehensive style over the selectivity and specificity characteristic of human aesthetic writing, with implications for training and benchmarking open-ended MLLM generation beyond numeric scores. The multi-condition prompt design and grounding ablation are positive features for isolating effects.

major comments (3)
  1. [Dataset section] Dataset section: The paper treats the ranked Reddit Photo Critique Dataset references as the appropriate target distribution for human aesthetic critique without explicit justification or comparison to other sources of aesthetic writing. This premise is load-bearing for the claim that models diverge from humans in selectivity and aspect coverage; if Reddit critiques differ systematically in those dimensions, the divergence interpretation does not follow.
  2. [Image-grounding control] Image-grounding control (abstract and methods): The wrong-photograph ablation attributes residual coarse cosine similarity to house style. However, the description provides no details on wrong-photo selection criteria (e.g., visual statistics, scene similarity, or prompt-induced generic language), leaving open whether the control cleanly isolates house style from image-specific reasoning. This assumption underpins the interpretation that reference-based metrics are misleading.
  3. [Results on metrics] Results on metrics (likely §4 or evaluation subsection): The abstract states that stricter lexical and learned metrics show only weak alignment while coarse cosine reports broad overlap, but without naming the exact metrics, reporting their numerical values, or showing statistical tests in the provided description, the strength of the 'misleading picture' conclusion cannot be fully assessed.
minor comments (2)
  1. [Prompt conditions] The six prompt conditions are described at a high level in the abstract; a concise enumeration or table in the main text would improve reproducibility.
  2. [Behavioral analysis] Behavioral observations (length, aspect coverage, repetition) are summarized qualitatively; quantitative tables or figures with effect sizes would strengthen the divergence claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important areas for clarification, and we address each major point below with targeted revisions where the manuscript can be strengthened. All responses are based on the submitted manuscript content.

read point-by-point responses
  1. Referee: [Dataset section] Dataset section: The paper treats the ranked Reddit Photo Critique Dataset references as the appropriate target distribution for human aesthetic critique without explicit justification or comparison to other sources of aesthetic writing. This premise is load-bearing for the claim that models diverge from humans in selectivity and aspect coverage; if Reddit critiques differ systematically in those dimensions, the divergence interpretation does not follow.

    Authors: We agree that the Dataset section would benefit from explicit justification. In the revised manuscript we have added a dedicated paragraph explaining the choice: the Reddit Photo Critique Dataset supplies ranked, open-ended, image-specific human critiques on consumer photographs, which directly matches the open-ended aesthetic critique task. We also include a brief comparison noting that, unlike professional art criticism or curated exhibition reviews, this source provides diverse non-expert perspectives with explicit ranking, making it suitable for measuring selectivity and aspect coverage against human distributions. revision: yes

  2. Referee: [Image-grounding control] Image-grounding control (abstract and methods): The wrong-photograph ablation attributes residual coarse cosine similarity to house style. However, the description provides no details on wrong-photo selection criteria (e.g., visual statistics, scene similarity, or prompt-induced generic language), leaving open whether the control cleanly isolates house style from image-specific reasoning. This assumption underpins the interpretation that reference-based metrics are misleading.

    Authors: We accept that additional detail on the control is warranted. The revised Methods section now specifies the selection procedure: wrong photographs were drawn from the same dataset but chosen to have low CLIP embedding cosine similarity (<0.3) and to belong to different high-level scene categories (e.g., landscape vs. portrait) to reduce both visual and prompt-induced generic overlap. This strengthens the isolation of house style from image-specific reasoning. revision: yes

  3. Referee: [Results on metrics] Results on metrics (likely §4 or evaluation subsection): The abstract states that stricter lexical and learned metrics show only weak alignment while coarse cosine reports broad overlap, but without naming the exact metrics, reporting their numerical values, or showing statistical tests in the provided description, the strength of the 'misleading picture' conclusion cannot be fully assessed.

    Authors: The full manuscript already names the metrics in §4 (BLEU-4, ROUGE-L, METEOR, BERTScore, Sentence-BERT cosine, and CLIP cosine) and reports numerical values with standard errors in Table 2. We have now added explicit statistical comparisons (paired Wilcoxon tests against human inter-annotator baselines) with p-values in the same section and an appendix table to make the strength of the conclusion easier to assess without relying on the high-level abstract alone. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation against external human data

full rationale

The paper performs an empirical comparison of MLLM-generated critiques against ranked human references from the Reddit Photo Critique Dataset, using reference-based similarity metrics (lexical, learned, and embedding cosine) under controlled prompt variations and an image-grounding ablation. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing steps; the central claims rest on direct measurement against independent human data and off-the-shelf metrics rather than any self-definitional or self-referential reduction. The assumptions about the dataset and control are empirical premises open to external challenge, not circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation study. It contains no mathematical derivations, fitted parameters, or newly postulated entities. The sole domain assumption is that the chosen human reference set defines the desired target behavior.

axioms (1)
  • domain assumption The ranked human critiques in the Reddit Photo Critique Dataset represent the appropriate target for human-like aesthetic reasoning.
    All reference-based scoring and the claim of divergence rest on treating these human texts as the gold standard.

pith-pipeline@v0.9.1-grok · 5822 in / 1386 out tokens · 40262 ms · 2026-06-30T06:44:12.899517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 16 canonical work pages · 9 internal anchors

  1. [1]

    Can llms reason like hu- mans? assessing theory of mind reasoning in llms for open- ended questions

    Maryam Amirizaniani, Elias Martin, Maryna Sivachenko, Afra Mashhadi, and Chirag Shah. Can llms reason like hu- mans? assessing theory of mind reasoning in llms for open- ended questions. InProceedings of the 33rd ACM Interna- tional Conference on Information and Knowledge Manage- ment, pages 34–44, 2024. 1, 3

  2. [2]

    From filters to vlms: Benchmarking defogging methods through object detection and segmentation performance

    Ardalan Aryashad, Parsa Razmara, Amin Mahjoub, Seyedarmin Azizi, Mahdi Salmani, and Arad Firouzkouhi. From filters to vlms: Benchmarking defogging methods through object detection and segmentation performance. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 1106–1115, 2026. 4

  3. [3]

    Aesthetic critiques generation for photos

    Kuang-Yu Chang, Kung-Hung Lu, and Chu-Song Chen. Aesthetic critiques generation for photos. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3514–3523, 2017. 3

  4. [4]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. Length-controlled alpacaeval: A sim- ple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. 4, 8

  5. [5]

    Perceptual quality assessment of smartphone photog- raphy

    Yuming Fang, Hanwei Zhu, Yan Zeng, Kede Ma, and Zhou Wang. Perceptual quality assessment of smartphone photog- raphy. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3677– 3686, 2020. 1, 3

  6. [6]

    pfedmma: Personalized federated fine-tuning with multi-modal adapter for vision-language models.arXiv preprint arXiv:2507.05394, 2025

    Sajjad Ghiasvand, Mahnoosh Alizadeh, and Ramtin Pedarsani. pfedmma: Personalized federated fine-tuning with multi-modal adapter for vision-language models.arXiv preprint arXiv:2507.05394, 2025. 3, 8

  7. [7]

    Few-shot adversarial low- rank fine-tuning of vision-language models.arXiv preprint arXiv:2505.15130, 2025

    Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Al- izadeh, and Ramtin Pedarsani. Few-shot adversarial low- rank fine-tuning of vision-language models.arXiv preprint arXiv:2505.15130, 2025. 8

  8. [8]

    Mmlop: Multi-modal low- rank prompting for efficient vision-language adaptation

    Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Al- izadeh, and Ramtin Pedarsani. Mmlop: Multi-modal low- rank prompting for efficient vision-language adaptation. arXiv preprint arXiv:2602.21397, 2026. 3

  9. [9]

    Aes- thetic image captioning from weakly-labelled photographs

    Koustav Ghosal, Aakanksha Rana, and Aljosa Smolic. Aes- thetic image captioning from weakly-labelled photographs. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 0–0, 2019. 3

  10. [10]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Vaughan, An- gela Yang, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 1, 3, 4

  11. [11]

    arXiv preprint arXiv:2401.08276 (2024)

    Yipo Huang, Quan Yuan, Xiangfei Sheng, Zhichao Yang, Haoning Wu, Pengfei Chen, Yuzhe Yang, Leida Li, and Weisi Lin. Aesbench: An expert benchmark for multimodal large language models on image aesthetics perception.arXiv preprint arXiv:2401.08276, 2024. 1, 3

  12. [12]

    Eva: An explainable visual aesthetics dataset

    Chen Kang, Giuseppe Valenzise, and Fr ´ed´eric Dufaux. Eva: An explainable visual aesthetics dataset. InJoint Workshop on Aesthetic and Technical Quality Assessment of Multime- dia and Media Analytics for Societal Trends, pages 5–13,

  13. [13]

    Vista: Vision-language inference for training-free stock time-series analysis.arXiv preprint arXiv:2505.18570,

    Tina Khezresmaeilzadeh, Parsa Razmara, Seyedarmin Az- izi, Mohammad Erfan Sadeghi, and Erfan Baghaei Po- traghloo. Vista: Vision-language inference for training-free stock time-series analysis.arXiv preprint arXiv:2505.18570,

  14. [14]

    Morfi: Mutimodal zero-shot reasoning for financial time- series inference

    Tina Khezresmaeilzadeh, Parsa Razmara, Mohammad Erfan Sadeghi, Seyedarmin Azizi, and Erfan Baghaei Potraghloo. Morfi: Mutimodal zero-shot reasoning for financial time- series inference. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4236–4245,

  15. [15]

    Photo aesthetics ranking network with attributes and content adaptation

    Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless Fowlkes. Photo aesthetics ranking network with attributes and content adaptation. InEuropean Conference on Computer Vision (ECCV), pages 662–679, Cham, 2016. Springer International Publishing. 1, 3

  16. [16]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 1, 3, 4

  17. [17]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

  18. [18]

    Aesbiasbench: Evaluat- ing bias and alignment in multimodal language models for personalized image aesthetic assessment

    Kun Li, Lai Man Po, Hongzheng Yang, Xuyuan Xu, Kangcheng Liu, and Yuzhi Zhao. Aesbiasbench: Evaluat- ing bias and alignment in multimodal language models for personalized image aesthetic assessment. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 7618–7631, 2025. 3

  19. [19]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 1, 3, 5

  20. [20]

    Llava-next: Im- proved reasoning, ocr, and world knowledge.https: / / llava - vl

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge.https: / / llava - vl . github . io / blog / 2024 - 01 - 30 - llava-next/, 2024. Blog post. 1, 3, 4

  21. [21]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

  22. [22]

    G-eval: Nlg evaluation using gpt- 4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt- 4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language pro- cessing, pages 2511–2522, 2023. 4

  23. [23]

    Tan- gled up in bleu: Reevaluating the evaluation of automatic machine translation evaluation metrics

    Nitika Mathur, Timothy Baldwin, and Trevor Cohn. Tan- gled up in bleu: Reevaluating the evaluation of automatic machine translation evaluation metrics. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4984–4997, 2020. 1, 3

  24. [24]

    Ava: A large-scale database for aesthetic visual analysis

    Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A large-scale database for aesthetic visual analysis. 9 In2012 IEEE conference on computer vision and pattern recognition, pages 2408–2415. IEEE, 2012. 1, 3

  25. [25]

    Why we need new evaluation metrics for nlg

    Jekaterina Novikova, Ond ˇrej Duˇsek, Amanda Cercas Curry, and Verena Rieser. Why we need new evaluation metrics for nlg. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 2241–2252,

  26. [26]

    GPT-4V(ision) system card.https://openai

    OpenAI. GPT-4V(ision) system card.https://openai. com/index/gpt- 4v- system- card/, 2023. Ac- cessed: 2026-06-25. 3

  27. [27]

    Llm evaluators recognize and favor their own generations

    Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems, 37: 68772–68802, 2024. 4, 8

  28. [28]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  30. [30]

    Training-free mitiga- tion of adversarial attacks on deep learning-based mri recon- struction.arXiv preprint arXiv:2501.01908, 2025

    M Saberi, C Zhang, and M Akcakaya. Training-free mitiga- tion of adversarial attacks on deep learning-based mri recon- struction.arXiv preprint arXiv:2501.01908, 2025. 6

  31. [31]

    Bleurt: Learning robust metrics for text generation

    Thibault Sellam, Dipanjan Das, and Ankur Parikh. Bleurt: Learning robust metrics for text generation. InProceedings of the 58th annual meeting of the association for computa- tional linguistics, pages 7881–7892, 2020. 1, 3, 5

  32. [32]

    Nima: Neural image assessment.IEEE transactions on image processing, 27(8): 3998–4011, 2018

    Hossein Talebi and Peyman Milanfar. Nima: Neural image assessment.IEEE transactions on image processing, 27(8): 3998–4011, 2018. 1, 3

  33. [33]

    NeuroMambaLLM: Dynamic Graph Learning of fMRI Functional Connectivity in Autistic Brains Using Mamba and Language Model Reasoning

    Yasaman Torabi, Parsa Razmara, Hamed Ajorlou, and Bar- dia Baraeinejad. Neuromamballm: Dynamic graph learn- ing of fmri functional connectivity in autistic brains us- ing mamba and language model reasoning.arXiv preprint arXiv:2602.13770, 2026. 8

  34. [34]

    Understanding aesthetics with language: A photo critique dataset for aesthetic assessment.Advances in Neural Information Processing Systems (NeurIPS), 35: 34148–34161, 2022

    Daniel Vera Nieto, Luigi Celona, and Clara Fernan- dez Labrador. Understanding aesthetics with language: A photo critique dataset for aesthetic assessment.Advances in Neural Information Processing Systems (NeurIPS), 35: 34148–34161, 2022. 2, 3, 4

  35. [35]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 3, 4

  36. [36]

    Large language models are not fair evaluators

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 9440–9450, 2024. 4, 8

  37. [37]

    Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. 3

  38. [38]

    Q-bench: A benchmark for general-purpose foundation models on low-level vision

    Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. InInternational Conference on Learning Representations, pages 12547–12573, 2024. 1, 3

  39. [39]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.arXiv preprint arXiv:1904.09675, 2019. 1, 3, 5

  40. [40]

    Moverscore: Text generation eval- uating with contextualized embeddings and earth mover dis- tance

    Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. Moverscore: Text generation eval- uating with contextualized embeddings and earth mover dis- tance. InProceedings of the 2019 conference on empiri- cal methods in natural language processing and the 9th in- ternational joint conference on natural language processing (EMNLP-I...

  41. [41]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 4

  42. [42]

    arXiv preprint arXiv:2404.09619 (2024) A Multi-expert Voting Protocol We employ a multi-expert voting protocol to produce consensus aesthetic scores for the source poolO

    Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, and Di Zhang. Uniaa: A unified multi-modal image aesthetic assessment baseline and benchmark.arXiv preprint arXiv:2404.09619,

  43. [43]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 3, 4 10 Supplementary Material: Can MLLMs Critique Like Humans? Evaluating Open-...