arxiv: 2604.03765 · v2 · submitted 2026-04-04 · 💻 cs.CV

Recognition: no theorem link

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Zitong Xu , Huiyu Duan , Shengyao Qin , Guangyu Yang , Guangji Ma , Xiongkuo Min , Ke Gu , Guangtao Zhai

show 1 more author

Patrick Le Callet

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords image captioningmultimodal large language modelsevaluation metricreconstruction consistencyhuman judgment alignmentbenchmark datasetITIScoreICBench

0 comments

The pith

ITIScore rates MLLM image captions by measuring how consistently they allow reconstruction of the original image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates ICBench, a dataset of 40,000 captions from 10 recent MLLMs on 2,000 images across 12 categories, with human mean opinion scores on fluency, relevance, conciseness for short captions and completeness for long ones. It introduces ITIScore, an automatic metric that scores a caption by feeding it back into an image generator and checking consistency with the source image. Existing benchmarks are limited by short caption lengths, older models, and sparse human labels, so this setup supplies more diverse data and a scalable scorer. A sympathetic reader would care because the metric matches human ratings closely and works zero-shot on other caption datasets, reducing reliance on repeated human studies. Experiments confirm the alignment and generalization.

Core claim

The authors establish that their ITIScore metric, which operates by generating an image from the caption and measuring reconstruction consistency with the original image, provides an automatic rating of caption quality that aligns strongly with human mean opinion scores on fine-grained dimensions and generalizes robustly to other public captioning datasets.

What carries the argument

ITIScore, the image-to-text-to-image framework that quantifies caption quality through reconstruction consistency between the original image and the image regenerated from the caption text.

If this is right

The metric enables scalable evaluation of captioning performance across many more MLLMs and images without proportional increases in human annotation effort.
Separate scoring for short and long captions allows targeted assessment of conciseness versus completeness in addition to fluency and relevance.
Zero-shot application to existing public datasets provides immediate comparability without retraining the metric.
MLLM developers can track progress in caption generation using a consistent automatic signal that tracks human preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fixing the reconstruction model across successive MLLM versions would create an objective track record of captioning improvement over time.
The reconstruction-consistency idea could extend to automatic evaluation of other multimodal outputs such as visual question answering responses.
Averaging reconstruction results across several different image generators might reduce any bias introduced by a single reconstruction model.

Load-bearing premise

That the success of reconstructing the original image from the caption text accurately reflects the fine-grained human-rated qualities of fluency, relevance, conciseness, and completeness without introducing biases from the reconstruction model.

What would settle it

Collect new human ratings and compute ITIScore on captions from an additional MLLM not used in the original experiments; if the correlation between the automatic scores and human judgments falls substantially below the reported alignment, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.03765 by Guangji Ma, Guangtao Zhai, Guangyu Yang, Huiyu Duan, Ke Gu, Patrick Le Callet, Shengyao Qin, Xiongkuo Min, Zitong Xu.

**Figure 1.** Figure 1: Overview of our ICBench. (a) We first collect 2,040 source images across 12 fine-grained tasks. Then 10 advanced [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: MOS distribution of short caption and long caption across different evaluation dimensions. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison of different MLLMs on short captioning in terms of fluency, relevance, and conciseness, and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of our ITIScore. Given an image and its caption, a pretrained generative model reconstructs an image from [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark with recent models and human scores is the real value here, but the reconstruction metric still needs checks against its own generator's biases.

read the letter

The paper puts out ICBench, a dataset of 2,000 images across 12 categories with 40,000 captions from 10 current MLLMs, split into short and long versions, plus human mean opinion scores on fluency, relevance, conciseness for the short ones and fluency, relevance, completeness for the long ones. That scale and the inclusion of both caption lengths plus up-to-date models is a clear step past older captioning collections that were smaller and less current. The human annotation effort gives a usable ground truth for testing automatic metrics or new models. ITIScore itself works by feeding the caption to a fixed text-to-image model and scoring how well the output matches the original image. The abstract says this lines up with human ratings and holds up zero-shot on other datasets. The benchmark side is the stronger piece. Collecting those ratings across models and dimensions is concrete work that other groups can actually use for comparisons. The metric is thinner. It assumes the reconstructor does not inject its own preferences for certain visual styles or caption phrasings, yet the description gives no ablation on which text-to-image model was chosen or how the similarity measure was tuned. If the generator misses fine details that humans care about, low scores could appear even on captions that people rate highly. This is the main soft spot, and it is not minor because the whole automatic score rests on that independence. The work is aimed at people who evaluate or iterate on MLLMs and need better caption benchmarks or metrics. The dataset alone is worth having around. I would send it to peer review because the data collection is substantial and the metric proposal is concrete enough to get useful referee feedback on the validation steps.

Referee Report

2 major / 1 minor

Summary. The paper introduces ICBench, a benchmark with 2K images across 12 categories and 40K captions (short and long) generated by 10 advanced MLLMs, accompanied by human MOS ratings on fluency/relevance/conciseness for short captions and fluency/relevance/completeness for long captions. It proposes ITIScore, an automated metric that scores caption quality via reconstruction consistency between the original image and the image generated from the caption by a fixed text-to-image model. The authors claim this metric exhibits strong alignment with human judgments and robust zero-shot generalization to other public captioning datasets.

Significance. If the alignment claim holds after proper validation, ITIScore would supply a scalable, low-cost automated proxy for human evaluation of MLLM captioning, addressing the diversity and annotation limitations of prior benchmarks and enabling faster iteration on new models.

major comments (2)

[Abstract] Abstract: the claim of 'strong alignment between our automatic metric and human judgments' is presented without any quantitative support (correlation coefficients, p-values, error bars, or sample sizes), leaving the central empirical result without visible derivation or validation steps.
[ITIScore framework] ITIScore definition (image-to-text-to-image framework): the reconstruction consistency score is only valid if the fixed text-to-image model's generative prior is neutral with respect to the human-rated dimensions; no ablation on reconstructor choice, no comparison of similarity metrics (CLIP, LPIPS, pixel-level), and no independence test are described, so the metric may partly reflect reconstructor artifacts rather than caption fidelity.

minor comments (1)

[Abstract] Abstract: the total of 40K captions from 10 MLLMs on 2K images implies 20 captions per image; clarify whether this count includes both short and long variants per image or how the split is performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'strong alignment between our automatic metric and human judgments' is presented without any quantitative support (correlation coefficients, p-values, error bars, or sample sizes), leaving the central empirical result without visible derivation or validation steps.

Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript reports Pearson and Spearman correlations (with p-values and sample sizes) in the experimental results section. We will revise the abstract to incorporate these key metrics directly, e.g., updating the final sentence to reference the observed correlations on ICBench and the zero-shot datasets. revision: yes
Referee: [ITIScore framework] ITIScore definition (image-to-text-to-image framework): the reconstruction consistency score is only valid if the fixed text-to-image model's generative prior is neutral with respect to the human-rated dimensions; no ablation on reconstructor choice, no comparison of similarity metrics (CLIP, LPIPS, pixel-level), and no independence test are described, so the metric may partly reflect reconstructor artifacts rather than caption fidelity.

Authors: This is a fair and important point regarding potential bias from the fixed reconstructor. While the main experiments use a single Stable Diffusion model with CLIP similarity, we did not include the requested ablations or independence analysis in the initial submission. We will add a dedicated subsection with ablations across multiple T2I models, alternative similarity metrics (CLIP, LPIPS, and pixel-level), and correlation tests between reconstructor outputs and human dimension scores to demonstrate that the metric primarily captures caption fidelity. revision: yes

Circularity Check

0 steps flagged

ITIScore is defined directly as reconstruction consistency; no derivation reduces to inputs by construction

full rationale

The paper explicitly defines ITIScore as the similarity between an original image and the image reconstructed from its caption via a fixed text-to-image model. This is a definitional proposal, not a derived result. Alignment with human MOS is presented as an empirical finding on the ICBench dataset, with no equations or steps that equate the metric to its own inputs or to self-cited priors. No fitted parameters are renamed as predictions, no uniqueness theorems are imported, and no ansatz is smuggled via citation. The framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that image reconstruction consistency is a valid proxy for human caption quality judgments; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Reconstruction consistency from caption to image back to caption measures caption quality
Invoked in the definition of ITIScore as the core of the automated metric.

pith-pipeline@v0.9.0 · 5567 in / 1218 out tokens · 41557 ms · 2026-05-13T17:30:33.102124+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 9 internal anchors

[1]

Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, and Yiannis Aloi- monos. 2015. From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge.arXiv preprint arXiv:1511.03292 (2015)

work page arXiv 2015
[2]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. InProceedings of the European Conference on Computer Vision (ECCV). 382–398

work page 2016
[3]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2018
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, et al

work page
[5]

Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (ACL Workshop)

work page 2005
[7]

David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny

work page
[8]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Clair: Evaluating image captions with large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 13638–13646

work page 2023
[9]

Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, et al

work page
[10]

InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

Caparena: Benchmarking and analyzing detailed image captioning in the llm era. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 14077–14094

work page
[11]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 10578–10587

work page 2020
[13]

Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, et al. 2025. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR). 3206–3217

work page 2025
[14]

Huiyu Duan, Xiongkuo Min, Yucheng Zhu, Guangtao Zhai, Xiaokang Yang, and Patrick Le Callet. 2022. Confusing image quality assessment: Toward better augmented reality experience.IEEE Transactions on Image Processing (TIP)31 (2022), 7206–7221

work page 2022
[15]

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, et al. 2024. Scaling rectified flow transformers for high-resolution im- age synthesis. InProceedings of the International Conference on Machine Learning (ICML)

work page 2024
[16]

Google. 2025. Gemini 3.0. https://deepmind.google/technologies/gemini/

work page 2025
[17]

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi

work page
[18]

In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

CLIPScore: A Reference-Free Evaluation Metric for Image Captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 7504–7513

work page
[19]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics.Journal of Artificial Intelligence Research (JAIR)47 (2013), 853–899

work page 2013
[20]

2012.Methodology for the Subjec- tive Assessment of the Quality of Television Pictures

International Telecommunication Union (ITU). 2012.Methodology for the Subjec- tive Assessment of the Quality of Television Pictures. Technical Report Rec. ITU-R BT.500-13. International Telecommunication Union (ITU)

work page 2012
[21]

Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, and Noah A Smith. 2022. Transparent human evaluation for image captioning. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 3464–3478

work page 2022
[22]

Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, and Sang-Woo Lee. 2022. Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. InAdvances in Neural Information Processing Systems (NeurIPS). 35072– 35086

work page 2022
[23]

Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. 2021. UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 220–226

work page 2021
[24]

Yebin Lee, Imseong Park, and Myungjoo Kang. 2024. Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model. InProceedings of the Association for Computational Linguistics (ACL). 3732–3746

work page 2024
[25]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Boot- strapping Language-Image Pre-training for Unified Vision-Language Understand- ing and Generation. InProceedings of the International Conference on Machine Learning (ICML)

work page 2022
[26]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out: Proceedings of the ACL Workshop

work page 2004
[27]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, et al

work page
[28]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.arXiv preprint arXiv:2405.04434(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26296–26306

work page 2024
[30]

Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, et al. 2025. MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration.IEEE Journal of Selected Topics in Signal Processing(2025)

work page 2025
[31]

Xin Liu, Yang Li, Zhe Wang, et al . 2024. VELA: Evaluating Long-form Image Captioning with Automated Metrics and Human Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2024
[32]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, et al. 2025. Ovis2.5 Technical Report.arXiv:2508.11737(2025)

work page arXiv 2025
[33]

Koki Maeda, Shuhei Kurita, Taiki Miyanishi, and Naoaki Okazaki. 2024. Vi- sion language model-based caption evaluation method leveraging visual context extraction.arXiv preprint arXiv:2402.17969(2024)

work page arXiv 2024
[34]

AI Meta. 2024. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Cus- tomizable Models.Meta AI Blog. Retrieved December(2024)

work page 2024
[35]

Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, and Nakamasa Inoue. 2024. Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Lan- guage Models.arXiv preprint arXiv:2412.14613(2024)

work page arXiv 2024
[36]

OpenAI. 2025. ChatGPT-4o: Advanced Multimodal Chat Model. https://openai. com/chatgpt

work page 2025
[37]

OpenAI. 2025. GPT-5. https://www.openai.com

work page 2025
[38]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 10971–10980

work page 2020
[39]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

work page 2002
[40]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the International Conference on Machine Learning (ICML)

work page 2021
[41]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695

work page 2022
[42]

Noam Rotstein, David Bensaid, Shaked Brody, Roy Ganz, and Ron Kimmel. 2024. Fusecap: Leveraging large language models for enriched fused image captions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (CVPR). 5689–5700

work page 2024
[43]

Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. 2023. Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. InProceedings of the IEEE/CVF Conference on Computer MM ’26, November 10–14 2026, Rio de Janeiro, Brazil Vision and Pattern Recognition (CVPR). 6914–6924

work page 2023
[44]

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. BRIDGE: Bridging gaps in image captioning evaluation with stronger visual cues. InPro- ceedings of the European Conference on Computer Vision (ECCV). 70–87

work page 2024
[45]

Sara Sarto, Marcella Cornia, and Rita Cucchiara. 2025. Image captioning evalua- tion in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604(2025)

work page arXiv 2025
[46]

Tony Cheng Tong, Sirui He, Zhiwen Shao, and Dit-Yan Yeung. 2025. G-veval: A versatile metric for evaluating image and video captions using gpt-4o. In Proceedings of the Conference on Association for the Advancement of Artificial Intelligence (AAAI), Vol. 39. 7419–7427

work page 2025
[47]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based Image Description Evaluation. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR). 4566–4575

work page 2015
[48]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2015
[49]

Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sugiura. 2024. Polos: Multi- modal metric learning from human feedback for image captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13559–13568

work page 2024
[50]

Haoran Wang, Yue Zhang, and Xiaosheng Yu. 2020. An overview of image caption generation methods.Computational intelligence and neuroscience (CIN)2020, 1 (2020), 3062706

work page 2020
[51]

Jiarui Wang, Huiyu Duan, Guangtao Zhai, and Xiongkuo Min. 2026. Quality assessment for AI generated images with instruction tuning.IEEE Transactions on Multimedia (TMM)(2026)

work page 2026
[52]

Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. 2025. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 17312–17323

work page 2025
[53]

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. 2022. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100(2022)

work page arXiv 2022
[54]

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, et al. 2025. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, et al

work page
[56]

Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. InProceedings of the International Conference on Machine Learning (ICML)

work page 2015
[58]

Liming Xu, Quan Tang, Jiancheng Lv, Bochuan Zheng, Xianhua Zeng, and Weisheng Li. 2023. Deep image captioning: A review of methods, trends and future challenges.Neurocomputing546 (2023), 126287

work page 2023
[59]

Zitong Xu, Huiyu Duan, Zhongpeng Ji, Xinyun Zhang, Yutao Liu, Xiongkuo Min, et al. 2026. EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing.arXiv preprint arXiv:2603.14916(2026)

work page arXiv 2026
[60]

Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, et al

work page
[61]

InProceedings of the ACM International Conference on Multimedia (ACM MM)

LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs. InProceedings of the ACM International Conference on Multimedia (ACM MM). 6908–6917

work page
[62]

Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, et al

work page
[63]

InIEEE International Conference on Multimedia and Expo (ICME)

Harmonyiqa: Pioneering benchmark and model for image harmonization quality assessment. InIEEE International Conference on Multimedia and Expo (ICME). 1–6

work page
[64]

Zitong Xu, Huiyu Duan, Xiaoyu Wang, Zhaolin Cai, Kaiwei Zhang, Qiang Hu, Jing Liu, Xiongkuo Min, and Guangtao Zhai. 2026. ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation. arXiv preprint arXiv:2511.14259(2026)

work page arXiv 2026
[65]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu, Yunhao Li, et al. 2025. ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments? arXiv preprint arXiv:2510.11549(2025)

work page arXiv 2025
[67]

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, et al

work page
[68]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone.arXiv preprint arXiv:2408.01800(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Ziwei Yao, Ruiping Wang, and Xilin Chen. 2024. Hifi-score: Fine-grained image description evaluation with hierarchical parsing graphs. InProceedings of the European Conference on Computer Vision (ECCV). 441–458

work page 2024
[70]

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, et al. 2024. mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multimodal Large Language Models. InProceedings of the International Conference on Learning Representations (ICLR)

work page 2024
[71]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi

work page
[72]

BERTScore: Evaluating Text Generation with BERT.arXiv preprint arXiv:1904.09675(2020)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[73]

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, et al. 2025. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.arXiv preprint arXiv:2504.10479(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025