Recognition: no theorem link
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
Pith reviewed 2026-05-13 17:30 UTC · model grok-4.3
The pith
ITIScore rates MLLM image captions by measuring how consistently they allow reconstruction of the original image.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that their ITIScore metric, which operates by generating an image from the caption and measuring reconstruction consistency with the original image, provides an automatic rating of caption quality that aligns strongly with human mean opinion scores on fine-grained dimensions and generalizes robustly to other public captioning datasets.
What carries the argument
ITIScore, the image-to-text-to-image framework that quantifies caption quality through reconstruction consistency between the original image and the image regenerated from the caption text.
If this is right
- The metric enables scalable evaluation of captioning performance across many more MLLMs and images without proportional increases in human annotation effort.
- Separate scoring for short and long captions allows targeted assessment of conciseness versus completeness in addition to fluency and relevance.
- Zero-shot application to existing public datasets provides immediate comparability without retraining the metric.
- MLLM developers can track progress in caption generation using a consistent automatic signal that tracks human preferences.
Where Pith is reading between the lines
- Fixing the reconstruction model across successive MLLM versions would create an objective track record of captioning improvement over time.
- The reconstruction-consistency idea could extend to automatic evaluation of other multimodal outputs such as visual question answering responses.
- Averaging reconstruction results across several different image generators might reduce any bias introduced by a single reconstruction model.
Load-bearing premise
That the success of reconstructing the original image from the caption text accurately reflects the fine-grained human-rated qualities of fluency, relevance, conciseness, and completeness without introducing biases from the reconstruction model.
What would settle it
Collect new human ratings and compute ITIScore on captions from an additional MLLM not used in the original experiments; if the correlation between the automatic scores and human judgments falls substantially below the reported alignment, the central claim is falsified.
Figures
read the original abstract
Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ICBench, a benchmark with 2K images across 12 categories and 40K captions (short and long) generated by 10 advanced MLLMs, accompanied by human MOS ratings on fluency/relevance/conciseness for short captions and fluency/relevance/completeness for long captions. It proposes ITIScore, an automated metric that scores caption quality via reconstruction consistency between the original image and the image generated from the caption by a fixed text-to-image model. The authors claim this metric exhibits strong alignment with human judgments and robust zero-shot generalization to other public captioning datasets.
Significance. If the alignment claim holds after proper validation, ITIScore would supply a scalable, low-cost automated proxy for human evaluation of MLLM captioning, addressing the diversity and annotation limitations of prior benchmarks and enabling faster iteration on new models.
major comments (2)
- [Abstract] Abstract: the claim of 'strong alignment between our automatic metric and human judgments' is presented without any quantitative support (correlation coefficients, p-values, error bars, or sample sizes), leaving the central empirical result without visible derivation or validation steps.
- [ITIScore framework] ITIScore definition (image-to-text-to-image framework): the reconstruction consistency score is only valid if the fixed text-to-image model's generative prior is neutral with respect to the human-rated dimensions; no ablation on reconstructor choice, no comparison of similarity metrics (CLIP, LPIPS, pixel-level), and no independence test are described, so the metric may partly reflect reconstructor artifacts rather than caption fidelity.
minor comments (1)
- [Abstract] Abstract: the total of 40K captions from 10 MLLMs on 2K images implies 20 captions per image; clarify whether this count includes both short and long variants per image or how the split is performed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will make the necessary revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'strong alignment between our automatic metric and human judgments' is presented without any quantitative support (correlation coefficients, p-values, error bars, or sample sizes), leaving the central empirical result without visible derivation or validation steps.
Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript reports Pearson and Spearman correlations (with p-values and sample sizes) in the experimental results section. We will revise the abstract to incorporate these key metrics directly, e.g., updating the final sentence to reference the observed correlations on ICBench and the zero-shot datasets. revision: yes
-
Referee: [ITIScore framework] ITIScore definition (image-to-text-to-image framework): the reconstruction consistency score is only valid if the fixed text-to-image model's generative prior is neutral with respect to the human-rated dimensions; no ablation on reconstructor choice, no comparison of similarity metrics (CLIP, LPIPS, pixel-level), and no independence test are described, so the metric may partly reflect reconstructor artifacts rather than caption fidelity.
Authors: This is a fair and important point regarding potential bias from the fixed reconstructor. While the main experiments use a single Stable Diffusion model with CLIP similarity, we did not include the requested ablations or independence analysis in the initial submission. We will add a dedicated subsection with ablations across multiple T2I models, alternative similarity metrics (CLIP, LPIPS, and pixel-level), and correlation tests between reconstructor outputs and human dimension scores to demonstrate that the metric primarily captures caption fidelity. revision: yes
Circularity Check
ITIScore is defined directly as reconstruction consistency; no derivation reduces to inputs by construction
full rationale
The paper explicitly defines ITIScore as the similarity between an original image and the image reconstructed from its caption via a fixed text-to-image model. This is a definitional proposal, not a derived result. Alignment with human MOS is presented as an empirical finding on the ICBench dataset, with no equations or steps that equate the metric to its own inputs or to self-cited priors. No fitted parameters are renamed as predictions, no uniqueness theorems are imported, and no ansatz is smuggled via citation. The framework is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reconstruction consistency from caption to image back to caption measures caption quality
Reference graph
Works this paper leans on
- [1]
-
[2]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. InProceedings of the European Conference on Computer Vision (ECCV). 382–398
work page 2016
-
[3]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2018
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, et al
-
[5]
Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (ACL Workshop)
work page 2005
-
[7]
David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny
-
[8]
InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Clair: Evaluating image captions with large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 13638–13646
work page 2023
-
[9]
Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, et al
-
[10]
InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)
Caparena: Benchmarking and analyzing detailed image captioning in the llm era. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 14077–14094
-
[11]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 10578–10587
work page 2020
-
[13]
Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, et al. 2025. Finevq: Fine-grained user generated content video quality assessment. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR). 3206–3217
work page 2025
-
[14]
Huiyu Duan, Xiongkuo Min, Yucheng Zhu, Guangtao Zhai, Xiaokang Yang, and Patrick Le Callet. 2022. Confusing image quality assessment: Toward better augmented reality experience.IEEE Transactions on Image Processing (TIP)31 (2022), 7206–7221
work page 2022
-
[15]
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, et al. 2024. Scaling rectified flow transformers for high-resolution im- age synthesis. InProceedings of the International Conference on Machine Learning (ICML)
work page 2024
-
[16]
Google. 2025. Gemini 3.0. https://deepmind.google/technologies/gemini/
work page 2025
-
[17]
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi
-
[18]
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
CLIPScore: A Reference-Free Evaluation Metric for Image Captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 7504–7513
-
[19]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics.Journal of Artificial Intelligence Research (JAIR)47 (2013), 853–899
work page 2013
-
[20]
2012.Methodology for the Subjec- tive Assessment of the Quality of Television Pictures
International Telecommunication Union (ITU). 2012.Methodology for the Subjec- tive Assessment of the Quality of Television Pictures. Technical Report Rec. ITU-R BT.500-13. International Telecommunication Union (ITU)
work page 2012
-
[21]
Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, and Noah A Smith. 2022. Transparent human evaluation for image captioning. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 3464–3478
work page 2022
-
[22]
Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, and Sang-Woo Lee. 2022. Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. InAdvances in Neural Information Processing Systems (NeurIPS). 35072– 35086
work page 2022
-
[23]
Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. 2021. UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 220–226
work page 2021
-
[24]
Yebin Lee, Imseong Park, and Myungjoo Kang. 2024. Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model. InProceedings of the Association for Computational Linguistics (ACL). 3732–3746
work page 2024
-
[25]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Boot- strapping Language-Image Pre-training for Unified Vision-Language Understand- ing and Generation. InProceedings of the International Conference on Machine Learning (ICML)
work page 2022
-
[26]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out: Proceedings of the ACL Workshop
work page 2004
-
[27]
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, et al
-
[28]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.arXiv preprint arXiv:2405.04434(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26296–26306
work page 2024
-
[30]
Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, et al. 2025. MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration.IEEE Journal of Selected Topics in Signal Processing(2025)
work page 2025
-
[31]
Xin Liu, Yang Li, Zhe Wang, et al . 2024. VELA: Evaluating Long-form Image Captioning with Automated Metrics and Human Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2024
- [32]
- [33]
-
[34]
AI Meta. 2024. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Cus- tomizable Models.Meta AI Blog. Retrieved December(2024)
work page 2024
- [35]
-
[36]
OpenAI. 2025. ChatGPT-4o: Advanced Multimodal Chat Model. https://openai. com/chatgpt
work page 2025
-
[37]
OpenAI. 2025. GPT-5. https://www.openai.com
work page 2025
-
[38]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 10971–10980
work page 2020
-
[39]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)
work page 2002
-
[40]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the International Conference on Machine Learning (ICML)
work page 2021
-
[41]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695
work page 2022
-
[42]
Noam Rotstein, David Bensaid, Shaked Brody, Roy Ganz, and Ron Kimmel. 2024. Fusecap: Leveraging large language models for enriched fused image captions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (CVPR). 5689–5700
work page 2024
-
[43]
Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. 2023. Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. InProceedings of the IEEE/CVF Conference on Computer MM ’26, November 10–14 2026, Rio de Janeiro, Brazil Vision and Pattern Recognition (CVPR). 6914–6924
work page 2023
-
[44]
Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. BRIDGE: Bridging gaps in image captioning evaluation with stronger visual cues. InPro- ceedings of the European Conference on Computer Vision (ECCV). 70–87
work page 2024
- [45]
-
[46]
Tony Cheng Tong, Sirui He, Zhiwen Shao, and Dit-Yan Yeung. 2025. G-veval: A versatile metric for evaluating image and video captions using gpt-4o. In Proceedings of the Conference on Association for the Advancement of Artificial Intelligence (AAAI), Vol. 39. 7419–7427
work page 2025
-
[47]
Lawrence Zitnick, and Devi Parikh
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based Image Description Evaluation. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR). 4566–4575
work page 2015
-
[48]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2015
-
[49]
Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sugiura. 2024. Polos: Multi- modal metric learning from human feedback for image captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13559–13568
work page 2024
-
[50]
Haoran Wang, Yue Zhang, and Xiaosheng Yu. 2020. An overview of image caption generation methods.Computational intelligence and neuroscience (CIN)2020, 1 (2020), 3062706
work page 2020
-
[51]
Jiarui Wang, Huiyu Duan, Guangtao Zhai, and Xiongkuo Min. 2026. Quality assessment for AI generated images with instruction tuning.IEEE Transactions on Multimedia (TMM)(2026)
work page 2026
-
[52]
Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, and Xiongkuo Min. 2025. Lmm4lmm: Benchmarking and evaluating large-multimodal image generation with lmms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 17312–17323
work page 2025
- [53]
-
[54]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, et al. 2025. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, et al
-
[56]
Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. InProceedings of the International Conference on Machine Learning (ICML)
work page 2015
-
[58]
Liming Xu, Quan Tang, Jiancheng Lv, Bochuan Zheng, Xianhua Zeng, and Weisheng Li. 2023. Deep image captioning: A review of methods, trends and future challenges.Neurocomputing546 (2023), 126287
work page 2023
- [59]
-
[60]
Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, et al
-
[61]
InProceedings of the ACM International Conference on Multimedia (ACM MM)
LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs. InProceedings of the ACM International Conference on Multimedia (ACM MM). 6908–6917
-
[62]
Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, et al
-
[63]
InIEEE International Conference on Multimedia and Expo (ICME)
Harmonyiqa: Pioneering benchmark and model for image harmonization quality assessment. InIEEE International Conference on Multimedia and Expo (ICME). 1–6
- [64]
-
[65]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [66]
-
[67]
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, et al
-
[68]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone.arXiv preprint arXiv:2408.01800(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Ziwei Yao, Ruiping Wang, and Xilin Chen. 2024. Hifi-score: Fine-grained image description evaluation with hierarchical parsing graphs. InProceedings of the European Conference on Computer Vision (ECCV). 441–458
work page 2024
-
[70]
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, et al. 2024. mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multimodal Large Language Models. InProceedings of the International Conference on Learning Representations (ICLR)
work page 2024
-
[71]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi
-
[72]
BERTScore: Evaluating Text Generation with BERT.arXiv preprint arXiv:1904.09675(2020)
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[73]
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, et al. 2025. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.arXiv preprint arXiv:2504.10479(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.