pith. machine review for the scientific record. sign in

arxiv: 2604.03765 · v2 · submitted 2026-04-04 · 💻 cs.CV

Recognition: no theorem link

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords image captioningmultimodal large language modelsevaluation metricreconstruction consistencyhuman judgment alignmentbenchmark datasetITIScoreICBench
0
0 comments X

The pith

ITIScore rates MLLM image captions by measuring how consistently they allow reconstruction of the original image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates ICBench, a dataset of 40,000 captions from 10 recent MLLMs on 2,000 images across 12 categories, with human mean opinion scores on fluency, relevance, conciseness for short captions and completeness for long ones. It introduces ITIScore, an automatic metric that scores a caption by feeding it back into an image generator and checking consistency with the source image. Existing benchmarks are limited by short caption lengths, older models, and sparse human labels, so this setup supplies more diverse data and a scalable scorer. A sympathetic reader would care because the metric matches human ratings closely and works zero-shot on other caption datasets, reducing reliance on repeated human studies. Experiments confirm the alignment and generalization.

Core claim

The authors establish that their ITIScore metric, which operates by generating an image from the caption and measuring reconstruction consistency with the original image, provides an automatic rating of caption quality that aligns strongly with human mean opinion scores on fine-grained dimensions and generalizes robustly to other public captioning datasets.

What carries the argument

ITIScore, the image-to-text-to-image framework that quantifies caption quality through reconstruction consistency between the original image and the image regenerated from the caption text.

If this is right

  • The metric enables scalable evaluation of captioning performance across many more MLLMs and images without proportional increases in human annotation effort.
  • Separate scoring for short and long captions allows targeted assessment of conciseness versus completeness in addition to fluency and relevance.
  • Zero-shot application to existing public datasets provides immediate comparability without retraining the metric.
  • MLLM developers can track progress in caption generation using a consistent automatic signal that tracks human preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fixing the reconstruction model across successive MLLM versions would create an objective track record of captioning improvement over time.
  • The reconstruction-consistency idea could extend to automatic evaluation of other multimodal outputs such as visual question answering responses.
  • Averaging reconstruction results across several different image generators might reduce any bias introduced by a single reconstruction model.

Load-bearing premise

That the success of reconstructing the original image from the caption text accurately reflects the fine-grained human-rated qualities of fluency, relevance, conciseness, and completeness without introducing biases from the reconstruction model.

What would settle it

Collect new human ratings and compute ITIScore on captions from an additional MLLM not used in the original experiments; if the correlation between the automatic scores and human judgments falls substantially below the reported alignment, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.03765 by Guangji Ma, Guangtao Zhai, Guangyu Yang, Huiyu Duan, Ke Gu, Patrick Le Callet, Shengyao Qin, Xiongkuo Min, Zitong Xu.

Figure 1
Figure 1. Figure 1: Overview of our ICBench. (a) We first collect 2,040 source images across 12 fine-grained tasks. Then 10 advanced [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MOS distribution of short caption and long caption across different evaluation dimensions. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of different MLLMs on short captioning in terms of fluency, relevance, and conciseness, and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our ITIScore. Given an image and its caption, a pretrained generative model reconstructs an image from [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ICBench, a benchmark with 2K images across 12 categories and 40K captions (short and long) generated by 10 advanced MLLMs, accompanied by human MOS ratings on fluency/relevance/conciseness for short captions and fluency/relevance/completeness for long captions. It proposes ITIScore, an automated metric that scores caption quality via reconstruction consistency between the original image and the image generated from the caption by a fixed text-to-image model. The authors claim this metric exhibits strong alignment with human judgments and robust zero-shot generalization to other public captioning datasets.

Significance. If the alignment claim holds after proper validation, ITIScore would supply a scalable, low-cost automated proxy for human evaluation of MLLM captioning, addressing the diversity and annotation limitations of prior benchmarks and enabling faster iteration on new models.

major comments (2)
  1. [Abstract] Abstract: the claim of 'strong alignment between our automatic metric and human judgments' is presented without any quantitative support (correlation coefficients, p-values, error bars, or sample sizes), leaving the central empirical result without visible derivation or validation steps.
  2. [ITIScore framework] ITIScore definition (image-to-text-to-image framework): the reconstruction consistency score is only valid if the fixed text-to-image model's generative prior is neutral with respect to the human-rated dimensions; no ablation on reconstructor choice, no comparison of similarity metrics (CLIP, LPIPS, pixel-level), and no independence test are described, so the metric may partly reflect reconstructor artifacts rather than caption fidelity.
minor comments (1)
  1. [Abstract] Abstract: the total of 40K captions from 10 MLLMs on 2K images implies 20 captions per image; clarify whether this count includes both short and long variants per image or how the split is performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'strong alignment between our automatic metric and human judgments' is presented without any quantitative support (correlation coefficients, p-values, error bars, or sample sizes), leaving the central empirical result without visible derivation or validation steps.

    Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript reports Pearson and Spearman correlations (with p-values and sample sizes) in the experimental results section. We will revise the abstract to incorporate these key metrics directly, e.g., updating the final sentence to reference the observed correlations on ICBench and the zero-shot datasets. revision: yes

  2. Referee: [ITIScore framework] ITIScore definition (image-to-text-to-image framework): the reconstruction consistency score is only valid if the fixed text-to-image model's generative prior is neutral with respect to the human-rated dimensions; no ablation on reconstructor choice, no comparison of similarity metrics (CLIP, LPIPS, pixel-level), and no independence test are described, so the metric may partly reflect reconstructor artifacts rather than caption fidelity.

    Authors: This is a fair and important point regarding potential bias from the fixed reconstructor. While the main experiments use a single Stable Diffusion model with CLIP similarity, we did not include the requested ablations or independence analysis in the initial submission. We will add a dedicated subsection with ablations across multiple T2I models, alternative similarity metrics (CLIP, LPIPS, and pixel-level), and correlation tests between reconstructor outputs and human dimension scores to demonstrate that the metric primarily captures caption fidelity. revision: yes

Circularity Check

0 steps flagged

ITIScore is defined directly as reconstruction consistency; no derivation reduces to inputs by construction

full rationale

The paper explicitly defines ITIScore as the similarity between an original image and the image reconstructed from its caption via a fixed text-to-image model. This is a definitional proposal, not a derived result. Alignment with human MOS is presented as an empirical finding on the ICBench dataset, with no equations or steps that equate the metric to its own inputs or to self-cited priors. No fitted parameters are renamed as predictions, no uniqueness theorems are imported, and no ansatz is smuggled via citation. The framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that image reconstruction consistency is a valid proxy for human caption quality judgments; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Reconstruction consistency from caption to image back to caption measures caption quality
    Invoked in the definition of ITIScore as the core of the automated metric.

pith-pipeline@v0.9.0 · 5567 in / 1218 out tokens · 41557 ms · 2026-05-13T17:30:33.102124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 9 internal anchors

  1. [1]

    Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, and Yiannis Aloi- monos. 2015. From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge.arXiv preprint arXiv:1511.03292 (2015)

  2. [2]

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. InProceedings of the European Conference on Computer Vision (ECCV). 382–398

  3. [3]

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, et al

  5. [5]

    Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025)

  6. [6]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (ACL Workshop)

  7. [7]

    David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny

  8. [8]

    InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Clair: Evaluating image captions with large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 13638–13646

  9. [9]

    Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, et al

  10. [10]

    InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

    Caparena: Benchmarking and analyzing detailed image captioning in the llm era. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 14077–14094

  11. [11]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261(2025)

  12. [12]

    Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 10578–10587

  13. [13]

    Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, et al. 2025. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR). 3206–3217

  14. [14]

    Huiyu Duan, Xiongkuo Min, Yucheng Zhu, Guangtao Zhai, Xiaokang Yang, and Patrick Le Callet. 2022. Confusing image quality assessment: Toward better augmented reality experience.IEEE Transactions on Image Processing (TIP)31 (2022), 7206–7221

  15. [15]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, et al. 2024. Scaling rectified flow transformers for high-resolution im- age synthesis. InProceedings of the International Conference on Machine Learning (ICML)

  16. [16]

    Google. 2025. Gemini 3.0. https://deepmind.google/technologies/gemini/

  17. [17]

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi

  18. [18]

    In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

    CLIPScore: A Reference-Free Evaluation Metric for Image Captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 7504–7513

  19. [19]

    Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics.Journal of Artificial Intelligence Research (JAIR)47 (2013), 853–899

  20. [20]

    2012.Methodology for the Subjec- tive Assessment of the Quality of Television Pictures

    International Telecommunication Union (ITU). 2012.Methodology for the Subjec- tive Assessment of the Quality of Television Pictures. Technical Report Rec. ITU-R BT.500-13. International Telecommunication Union (ITU)

  21. [21]

    Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, and Noah A Smith. 2022. Transparent human evaluation for image captioning. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 3464–3478

  22. [22]

    Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, and Sang-Woo Lee. 2022. Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. InAdvances in Neural Information Processing Systems (NeurIPS). 35072– 35086

  23. [23]

    Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. 2021. UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 220–226

  24. [24]

    Yebin Lee, Imseong Park, and Myungjoo Kang. 2024. Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model. InProceedings of the Association for Computational Linguistics (ACL). 3732–3746

  25. [25]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Boot- strapping Language-Image Pre-training for Unified Vision-Language Understand- ing and Generation. InProceedings of the International Conference on Machine Learning (ICML)

  26. [26]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out: Proceedings of the ACL Workshop

  27. [27]

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, et al

  28. [28]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.arXiv preprint arXiv:2405.04434(2024)

  29. [29]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26296–26306

  30. [30]

    Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, et al. 2025. MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration.IEEE Journal of Selected Topics in Signal Processing(2025)

  31. [31]

    Xin Liu, Yang Li, Zhe Wang, et al . 2024. VELA: Evaluating Long-form Image Captioning with Automated Metrics and Human Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  32. [32]

    Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, et al. 2025. Ovis2.5 Technical Report.arXiv:2508.11737(2025)

  33. [33]

    Koki Maeda, Shuhei Kurita, Taiki Miyanishi, and Naoaki Okazaki. 2024. Vi- sion language model-based caption evaluation method leveraging visual context extraction.arXiv preprint arXiv:2402.17969(2024)

  34. [34]

    AI Meta. 2024. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Cus- tomizable Models.Meta AI Blog. Retrieved December(2024)

  35. [35]

    Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, and Nakamasa Inoue. 2024. Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Lan- guage Models.arXiv preprint arXiv:2412.14613(2024)

  36. [36]

    OpenAI. 2025. ChatGPT-4o: Advanced Multimodal Chat Model. https://openai. com/chatgpt

  37. [37]

    OpenAI. 2025. GPT-5. https://www.openai.com

  38. [38]

    Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 10971–10980

  39. [39]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

  40. [40]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the International Conference on Machine Learning (ICML)

  41. [41]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695

  42. [42]

    Noam Rotstein, David Bensaid, Shaked Brody, Roy Ganz, and Ron Kimmel. 2024. Fusecap: Leveraging large language models for enriched fused image captions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (CVPR). 5689–5700

  43. [43]

    Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. 2023. Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. InProceedings of the IEEE/CVF Conference on Computer MM ’26, November 10–14 2026, Rio de Janeiro, Brazil Vision and Pattern Recognition (CVPR). 6914–6924

  44. [44]

    Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. BRIDGE: Bridging gaps in image captioning evaluation with stronger visual cues. InPro- ceedings of the European Conference on Computer Vision (ECCV). 70–87

  45. [45]

    Sara Sarto, Marcella Cornia, and Rita Cucchiara. 2025. Image captioning evalua- tion in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604(2025)

  46. [46]

    Tony Cheng Tong, Sirui He, Zhiwen Shao, and Dit-Yan Yeung. 2025. G-veval: A versatile metric for evaluating image and video captions using gpt-4o. In Proceedings of the Conference on Association for the Advancement of Artificial Intelligence (AAAI), Vol. 39. 7419–7427

  47. [47]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based Image Description Evaluation. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR). 4566–4575

  48. [48]

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  49. [49]

    Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sugiura. 2024. Polos: Multi- modal metric learning from human feedback for image captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13559–13568

  50. [50]

    Haoran Wang, Yue Zhang, and Xiaosheng Yu. 2020. An overview of image caption generation methods.Computational intelligence and neuroscience (CIN)2020, 1 (2020), 3062706

  51. [51]

    Jiarui Wang, Huiyu Duan, Guangtao Zhai, and Xiongkuo Min. 2026. Quality assessment for AI generated images with instruction tuning.IEEE Transactions on Multimedia (TMM)(2026)

  52. [52]

    Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. 2025. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 17312–17323

  53. [53]

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. 2022. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100(2022)

  54. [54]

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, et al. 2025. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265(2025)

  55. [55]

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, et al

  56. [56]

    Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324(2025)

  57. [57]

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. InProceedings of the International Conference on Machine Learning (ICML)

  58. [58]

    Liming Xu, Quan Tang, Jiancheng Lv, Bochuan Zheng, Xianhua Zeng, and Weisheng Li. 2023. Deep image captioning: A review of methods, trends and future challenges.Neurocomputing546 (2023), 126287

  59. [59]

    Zitong Xu, Huiyu Duan, Zhongpeng Ji, Xinyun Zhang, Yutao Liu, Xiongkuo Min, et al. 2026. EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing.arXiv preprint arXiv:2603.14916(2026)

  60. [60]

    Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, et al

  61. [61]

    InProceedings of the ACM International Conference on Multimedia (ACM MM)

    LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs. InProceedings of the ACM International Conference on Multimedia (ACM MM). 6908–6917

  62. [62]

    Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, et al

  63. [63]

    InIEEE International Conference on Multimedia and Expo (ICME)

    Harmonyiqa: Pioneering benchmark and model for image harmonization quality assessment. InIEEE International Conference on Multimedia and Expo (ICME). 1–6

  64. [64]

    Zitong Xu, Huiyu Duan, Xiaoyu Wang, Zhaolin Cai, Kaiwei Zhang, Qiang Hu, Jing Liu, Xiongkuo Min, and Guangtao Zhai. 2026. ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation. arXiv preprint arXiv:2511.14259(2026)

  65. [65]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

  66. [66]

    Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu, Yunhao Li, et al. 2025. ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments? arXiv preprint arXiv:2510.11549(2025)

  67. [67]

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, et al

  68. [68]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone.arXiv preprint arXiv:2408.01800(2024)

  69. [69]

    Ziwei Yao, Ruiping Wang, and Xilin Chen. 2024. Hifi-score: Fine-grained image description evaluation with hierarchical parsing graphs. InProceedings of the European Conference on Computer Vision (ECCV). 441–458

  70. [70]

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, et al. 2024. mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multimodal Large Language Models. InProceedings of the International Conference on Learning Representations (ICLR)

  71. [71]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi

  72. [72]

    BERTScore: Evaluating Text Generation with BERT.arXiv preprint arXiv:1904.09675(2020)

  73. [73]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, et al. 2025. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.arXiv preprint arXiv:2504.10479(2025)