VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis
Pith reviewed 2026-05-18 15:25 UTC · model grok-4.3
The pith
VC-Inspector evaluates video captions for factual accuracy without reference texts by learning from synthetic errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VC-Inspector is a large multimodal model trained to perform reference-free evaluation of video captions with a focus on factual accuracy. Using a systematic method to generate captions containing controllable factual errors along with graded quality scores and explanatory annotations, the model learns to deliver assessments that show state-of-the-art correlation with human judgments on benchmarks including VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF.
What carries the argument
The systematic framework for generating video captions with controllable factual errors, paired with graded quality scores and explanatory annotations, which enables supervised training of the evaluation model.
If this is right
- Delivers reproducible evaluations that do not depend on proprietary language models or services.
- Identifies specific factual issues in captions to guide improvements in video description models.
- Maintains strong performance when applied to caption datasets from varied sources and domains.
- Supplies explanatory annotations alongside scores for greater interpretability.
Where Pith is reading between the lines
- Similar synthetic error injection could improve evaluation for other generative tasks such as text summarization.
- Using the model as a critic during training of caption generators might reduce factual hallucinations in produced video descriptions.
- Extending the approach to longer video sequences could test its limits in handling extended context.
Load-bearing premise
The synthetic captions with injected factual errors closely resemble the factual mistakes produced by real video captioning models.
What would settle it
Collecting human judgments on factual accuracy for captions generated by current video models and measuring if VC-Inspector's correlation with those judgments falls substantially below the reported levels on the existing benchmarks.
Figures
read the original abstract
We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible and fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic framework for generating captions with controllable factual errors, paired with graded quality scores and explanatory annotations. Experiments demonstrate that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement. Project page is available at https://dipta007.github.io/VC-Inspector
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes VC-Inspector, a lightweight open-source large multimodal model (LMM) for reference-free evaluation of video captions with emphasis on factual accuracy. It introduces a systematic framework for generating synthetic captions containing controllable factual errors, accompanied by graded quality scores and explanatory annotations. The model is trained on this data and evaluated on multiple benchmarks, claiming state-of-the-art correlation with human judgments along with generalization across domains such as VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF, while also indicating utility for caption improvement.
Significance. If the central claims are substantiated, this work would supply a reproducible, open-source alternative to reference-dependent or proprietary metrics for assessing factual correctness in video captions. The synthetic generation framework with graded annotations and explanations represents a constructive contribution for interpretable training of factuality evaluators. These elements could support more reliable video understanding pipelines without reliance on external services.
major comments (2)
- [§4 (Experiments)] §4 (Experiments): The manuscript asserts state-of-the-art correlation with human judgments and generalization across VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF, yet provides no specific quantitative correlation values, ablation studies, or error-bar information. This absence makes it impossible to verify the superiority and robustness claims that form the core experimental contribution.
- [§3 (Synthetic Caption Generation Framework)] §3 (Synthetic Caption Generation Framework): The central generalization claim rests on the assumption that the injected factual errors (temporal drift, object hallucination, action mislabeling) reproduce the distribution of mistakes made by real video captioning models. No validation experiment comparing synthetic error patterns against outputs from deployed captioners is reported; without this, high benchmark scores may reflect overfitting to artificial templates rather than authentic failure modes.
minor comments (2)
- [Abstract] Abstract: The claim of 'state-of-the-art correlation' is stated without any numerical values; including the key Pearson or Spearman coefficients would strengthen the summary.
- [Throughout] Notation and terminology: Acronyms such as LMM, VATEX-Eval, and Flickr8K-CF should be defined on first use in the main text for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have addressed each major comment below and revised the manuscript to strengthen the experimental reporting and validation of the synthetic framework.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments): The manuscript asserts state-of-the-art correlation with human judgments and generalization across VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF, yet provides no specific quantitative correlation values, ablation studies, or error-bar information. This absence makes it impossible to verify the superiority and robustness claims that form the core experimental contribution.
Authors: We thank the referee for this observation. While correlation results appear in tables in the original §4, we agree that explicit numerical values, ablations, and error bars were not presented with sufficient prominence or detail in the text. In the revised manuscript we have expanded §4 with a new table that reports exact Pearson and Spearman correlations for VC-Inspector and all baselines on VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF. We have also added ablation studies on the contribution of each factual-error category and on model components, together with standard deviations computed over five independent training runs to supply error-bar information. These additions directly support verification of the reported superiority and robustness. revision: yes
-
Referee: [§3 (Synthetic Caption Generation Framework)] §3 (Synthetic Caption Generation Framework): The central generalization claim rests on the assumption that the injected factual errors (temporal drift, object hallucination, action mislabeling) reproduce the distribution of mistakes made by real video captioning models. No validation experiment comparing synthetic error patterns against outputs from deployed captioners is reported; without this, high benchmark scores may reflect overfitting to artificial templates rather than authentic failure modes.
Authors: We acknowledge the validity of this concern. The synthetic framework was constructed from error categories repeatedly documented in the video-captioning literature, yet we did not originally include a side-by-side distributional comparison with errors produced by deployed models. To address the point, the revised §3 now contains a new validation subsection. We generated captions with several open-source video captioners on a held-out video set, had the factual errors annotated by the same protocol used for the synthetic data, and report the resulting type distributions alongside those of the synthetic captions. The comparison shows substantial overlap while also noting residual differences, which we discuss as a limitation and direction for future refinement. revision: yes
Circularity Check
No significant circularity; evaluation is self-contained against external benchmarks
full rationale
The paper introduces VC-Inspector as a new LMM trained on a custom synthetic caption dataset with injected factual errors, then measures its performance via correlation with human judgments on independent external benchmarks (VATEX-Eval, Flickr8K-Expert, Flickr8K-CF). No equations, derivations, or fitted parameters are shown that reduce the reported SOTA correlations to quantities defined inside the paper itself. The central empirical claim rests on training followed by testing against held-out human labels rather than any self-referential construction, self-citation chain, or ansatz smuggled via prior work. This matches the default case of a self-contained model trained and evaluated on external data.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human judgments are the appropriate gold standard for factual correctness of video captions.
- domain assumption Captions generated by deliberately injecting controllable factual errors are representative of errors made by real captioning models.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
score=1−# of changed objects & actions / total # of objects & actions =1−|R|/(|O|+|A|)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VC-Inspector … reference-free … factual grounding … instruction tuning on ActivityNet-FG-It
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SPICE: Semantic Propositional Image Caption Evaluation
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic Propositional Image Cap- tion Evaluation, July 2016. arXiv:1607.08822 [cs]. 2, 6, 7 8
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Ze- sen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
METEOR: An Auto- matic Metric for MT Evaluation with Improved Correlation with Human Judgments
Satanjeev Banerjee and Alon Lavie. METEOR: An Auto- matic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Mea- sures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michi...
work page 2005
-
[4]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6
work page 2015
-
[5]
CLAIR: Evaluating Image Captions with Large Language Models
David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Dar- rell, and John Canny. CLAIR: Evaluating Image Captions with Large Language Models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Process- ing, pages 13638–13646, Singapore, Dec. 2023. Association for Computational ...
work page 2023
-
[6]
Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023
David Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023. 6, 7
-
[7]
Learning to Evaluate Image Captioning
Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. Learning to Evaluate Image Captioning, June 2018. arXiv:1806.06422 [cs]. 3
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
Multi-modal transformer for video retrieval
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer, 2020. 1
work page 2020
-
[9]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A Reference-free Evaluation Metric for Image Captioning, Mar. 2022. arXiv:2104.08718 [cs]. 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Micah Hodosh, Peter Young, and Julia Hockenmaier. Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Re- search, 47:853–899, 2013. 2, 5, 7
work page 2013
-
[11]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5
work page 2022
-
[12]
LITA: Language instructed temporal-localization assistant
De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. LITA: Language instructed temporal-localization assistant. InECCV, 2024. 1
work page 2024
-
[13]
TIGEr: Text-to-Image Grounding for Image Caption Evalu- ation, Sept
Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jianfeng Gao. TIGEr: Text-to-Image Grounding for Image Caption Evalu- ation, Sept. 2019. arXiv:1909.02050 [cs]. 3
-
[14]
Dense-Captioning Events in Videos, May 2017
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-Captioning Events in Videos, May 2017. 1, 2, 4, 5, 6
work page 2017
-
[15]
Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. Umic: An unreferenced metric for image captioning via contrastive learning.arXiv preprint arXiv:2106.14019, 2021. 3
-
[16]
ViL- BERTScore: Evaluating Image Caption Using Vision-and- Language BERT
Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. ViL- BERTScore: Evaluating Image Caption Using Vision-and- Language BERT. In Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, and Eduard Hovy, editors,Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 34–39, Online, Nov. 2020. Ass...
work page 2020
-
[17]
Yebin Lee, Imseong Park, and Myungjoo Kang. FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model, June 2024. arXiv:2406.06004 [cs]. 3
-
[18]
arXiv preprint arXiv:2305.17497 , year=
Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gho- lamreza Haffari, Fei Li, Donghong Ji, and Quan Hung Tran. FACTUAL: A Benchmark for Faithful and Consistent Tex- tual Scene Graph Parsing, June 2023. arXiv:2305.17497 [cs]. 2, 6, 7
-
[19]
ROUGE: A Package for Automatic Evalu- ation of Summaries
Chin-Yew Lin. ROUGE: A Package for Automatic Evalu- ation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. 1, 2, 3, 5, 7
work page 2004
-
[20]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 5, 12
work page 2023
-
[21]
Hui Liu and Xiaojun Wan. Models see hallucinations: Eval- uating the factuality in video captioning.arXiv preprint arXiv:2303.02961, 2023. 3, 6, 7
-
[22]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation us- ing GPT-4 with Better Human Alignment, May 2023. arXiv:2303.16634 [cs]. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019. 2
work page 2019
-
[24]
VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions
Pranava Madhyastha, Josiah Wang, and Lucia Specia. Vi- fidel: Evaluating the visual fidelity of image descriptions. arXiv preprint arXiv:1907.09340, 2019. 2
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[25]
Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction, Feb
Koki Maeda, Shuhei Kurita, Taiki Miyanishi, and Naoaki Okazaki. Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction, Feb. 2024. arXiv:2402.17969 [cs]. 3
-
[26]
Bleu: a Method for Automatic Evaluation of Ma- chine Translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Ma- chine Translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meet- ing of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. As- sociation for C...
work page 2002
-
[27]
Jeshmol P.J. and Binsu C. Kovoor. Video question answer- ing: A survey of the state-of-the-art.Journal of Visual Com- munication and Image Representation, 105:104320, 2024. 1 9
work page 2024
-
[28]
Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning.ArXiv, abs/2402.11435, 2024. 1
-
[29]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, Feb. 2021. arXiv:2103.00020 [cs]. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
Positive-augmented contrastive learning for image and video captioning evaluation
Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Positive-augmented contrastive learning for image and video captioning evaluation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6914–6924, 2023. 2, 3, 6, 7
work page 2023
-
[31]
Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, and Zheng-Jun Zha. EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Em- bedding Matching, July 2022. arXiv:2111.08919 [cs]. 1, 2, 3, 6, 7, 8
-
[32]
G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o
Tony Cheng Tong, Sirui He, Zhiwen Shao, and Dit-Yan Ye- ung. G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o. InAAAI, Mar. 2025. 3, 4, 6, 7
work page 2025
-
[33]
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based Image Description Eval- uation, June 2015. arXiv:1411.5726 [cs]. 1, 2, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[34]
Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning, Feb
Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sug- iura. Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning, Feb. 2024. arXiv:2402.18091 [cs]. 3
-
[35]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, Jan
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, Jan
-
[36]
arXiv:2307.06942. 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Ziang Xiao, Susu Zhang, Vivian Lai, and Q. Vera Liao. Eval- uating evaluation metrics: A framework for analyzing NLG evaluation metrics using measurement theory. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10967–10982, Singapore, Dec
work page 2023
-
[38]
Association for Computational Linguistics. 8
-
[39]
Improving image captioning evaluation by considering inter references vari- ance
Yanzhi Yi, Hangyu Deng, and Jinglu Hu. Improving image captioning evaluation by considering inter references vari- ance. InProceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 985–994, 2020. 2
work page 2020
-
[40]
HICEScore: A Hierarchical Metric for Image Captioning Evaluation
Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie, Zhengjue Wang, and Bo Chen. HICEScore: A Hierarchical Metric for Image Captioning Evaluation. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 866–875, Oct. 2024. arXiv:2407.18589 [cs]. 3
-
[41]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. Bertscore: Evaluating text gener- ation with bert. InInternational Conference on Learning Representations, 2020. 1, 6, 7
work page 2020
-
[42]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. BERTScore: Evaluating Text Gen- eration with BERT, Feb. 2020. arXiv:1904.09675 [cs]. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[43]
Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. InAAAI Conference on Artificial Intelligence, 2018. 1, 5 10 Appendix The appendix is organized as follows. Section A lists the detailed prompts used in the proposed data genera- tion pipeline, model fine-tuning, and explanation evalua- tio...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.