VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis

Shubhashis Roy Dipta; Subarna Tripathi; Tz-Ying Wu

arxiv: 2509.16538 · v3 · submitted 2025-09-20 · 💻 cs.CV · cs.CL

VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis

Shubhashis Roy Dipta , Tz-Ying Wu , Subarna Tripathi This is my paper

Pith reviewed 2026-05-18 15:25 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords video caption evaluationfactual accuracyreference-free metriclarge multimodal modelsynthetic error injectionhuman judgment correlation

0 comments

The pith

VC-Inspector evaluates video captions for factual accuracy without reference texts by learning from synthetic errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops VC-Inspector, a lightweight open-source multimodal model that assesses the factual correctness of video descriptions without needing a correct reference caption. It creates training examples by injecting specific factual mistakes into captions and provides the model with quality scores and explanations for those errors. This setup allows the model to produce evaluations that match human opinions more closely than earlier methods across several test sets. Readers would value this because reliable automatic checks can help researchers build better video captioning systems that avoid factual mistakes.

Core claim

VC-Inspector is a large multimodal model trained to perform reference-free evaluation of video captions with a focus on factual accuracy. Using a systematic method to generate captions containing controllable factual errors along with graded quality scores and explanatory annotations, the model learns to deliver assessments that show state-of-the-art correlation with human judgments on benchmarks including VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF.

What carries the argument

The systematic framework for generating video captions with controllable factual errors, paired with graded quality scores and explanatory annotations, which enables supervised training of the evaluation model.

If this is right

Delivers reproducible evaluations that do not depend on proprietary language models or services.
Identifies specific factual issues in captions to guide improvements in video description models.
Maintains strong performance when applied to caption datasets from varied sources and domains.
Supplies explanatory annotations alongside scores for greater interpretability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar synthetic error injection could improve evaluation for other generative tasks such as text summarization.
Using the model as a critic during training of caption generators might reduce factual hallucinations in produced video descriptions.
Extending the approach to longer video sequences could test its limits in handling extended context.

Load-bearing premise

The synthetic captions with injected factual errors closely resemble the factual mistakes produced by real video captioning models.

What would settle it

Collecting human judgments on factual accuracy for captions generated by current video models and measuring if VC-Inspector's correlation with those judgments falls substantially below the reported levels on the existing benchmarks.

Figures

Figures reproduced from arXiv: 2509.16538 by Shubhashis Roy Dipta, Subarna Tripathi, Tz-Ying Wu.

**Figure 1.** Figure 1: Existing reference-free metrics like EMScore [31] often fail to detect factual inaccuracies and lack a consistent scoring scale. VC-Inspector addresses these limitations by providing factually grounded, interpretable evaluations with explanations. tent retrieval [8]. To ensure the quality and reliability of captions, robust evaluation metrics are vital for advancing the field of video-language researc… view at source ↗

**Figure 2.** Figure 2: (left) We present a data generation pipeline designed to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Data generation pipeline to create a synthetic dataset for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visual examples from ActivityNet-FG-Eval (top) and VATEX-Eval (others). VC-Inspector produces quality assessments highly consistent with ground truth scores, along with explanatory insights into factual errors (highlighted in red). captions. We leverage these explanations as an auxiliary supervision signal during training, enabling the model to implicitly learn factual grounding, which has been shown to… view at source ↗

**Figure 5.** Figure 5: Additional visual examples on VATEX-Eval. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Additional visualization of VC-Inspector results on ActivityNet-FG-Eval videos, with candidate captions of diverse quality. Incorrect objects and actions are identified by VC-Inspector and labeled in red. In [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible and fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic framework for generating captions with controllable factual errors, paired with graded quality scores and explanatory annotations. Experiments demonstrate that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement. Project page is available at https://dipta007.github.io/VC-Inspector

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VC-Inspector pairs a lightweight LMM with controllable synthetic factual-error injection for video captions, which is a practical addition, but the SOTA correlation claims rest on details not visible in the abstract.

read the letter

The one or two things to know: this paper builds a lightweight multimodal model for checking factual accuracy in video captions without references, using a new framework to create training examples with controlled factual mistakes. The controllable error injection plus graded scores and explanations is the concrete new piece here. It does well by focusing on factuality rather than just fluency and by releasing the model and generation code openly, which supports reproducibility better than closed services. That setup could be picked up by people who need reference-free checks for video understanding tasks. The soft spots are around validation. The synthetic errors have to resemble the actual mistakes real captioners make, such as temporal drift or object hallucinations, or the model will overfit to artificial patterns and fail to generalize. The abstract asserts strong human correlation on VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF but gives no numbers, ablations, or error bars, so the soundness is hard to judge from what is shown. More direct comparison to existing reference-free metrics would also help. This paper is for researchers working on multimodal evaluation or video caption improvement who want a fact-aware tool they can run locally. A reader who cares about practical metrics for downstream uses like search or accessibility could extract value from the data-generation approach even if they adapt the model. It deserves serious peer review because the core idea is concrete and the open release makes verification straightforward, though the experiments will need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VC-Inspector, a lightweight open-source large multimodal model (LMM) for reference-free evaluation of video captions with emphasis on factual accuracy. It introduces a systematic framework for generating synthetic captions containing controllable factual errors, accompanied by graded quality scores and explanatory annotations. The model is trained on this data and evaluated on multiple benchmarks, claiming state-of-the-art correlation with human judgments along with generalization across domains such as VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF, while also indicating utility for caption improvement.

Significance. If the central claims are substantiated, this work would supply a reproducible, open-source alternative to reference-dependent or proprietary metrics for assessing factual correctness in video captions. The synthetic generation framework with graded annotations and explanations represents a constructive contribution for interpretable training of factuality evaluators. These elements could support more reliable video understanding pipelines without reliance on external services.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The manuscript asserts state-of-the-art correlation with human judgments and generalization across VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF, yet provides no specific quantitative correlation values, ablation studies, or error-bar information. This absence makes it impossible to verify the superiority and robustness claims that form the core experimental contribution.
[§3 (Synthetic Caption Generation Framework)] §3 (Synthetic Caption Generation Framework): The central generalization claim rests on the assumption that the injected factual errors (temporal drift, object hallucination, action mislabeling) reproduce the distribution of mistakes made by real video captioning models. No validation experiment comparing synthetic error patterns against outputs from deployed captioners is reported; without this, high benchmark scores may reflect overfitting to artificial templates rather than authentic failure modes.

minor comments (2)

[Abstract] Abstract: The claim of 'state-of-the-art correlation' is stated without any numerical values; including the key Pearson or Spearman coefficients would strengthen the summary.
[Throughout] Notation and terminology: Acronyms such as LMM, VATEX-Eval, and Flickr8K-CF should be defined on first use in the main text for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below and revised the manuscript to strengthen the experimental reporting and validation of the synthetic framework.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The manuscript asserts state-of-the-art correlation with human judgments and generalization across VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF, yet provides no specific quantitative correlation values, ablation studies, or error-bar information. This absence makes it impossible to verify the superiority and robustness claims that form the core experimental contribution.

Authors: We thank the referee for this observation. While correlation results appear in tables in the original §4, we agree that explicit numerical values, ablations, and error bars were not presented with sufficient prominence or detail in the text. In the revised manuscript we have expanded §4 with a new table that reports exact Pearson and Spearman correlations for VC-Inspector and all baselines on VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF. We have also added ablation studies on the contribution of each factual-error category and on model components, together with standard deviations computed over five independent training runs to supply error-bar information. These additions directly support verification of the reported superiority and robustness. revision: yes
Referee: [§3 (Synthetic Caption Generation Framework)] §3 (Synthetic Caption Generation Framework): The central generalization claim rests on the assumption that the injected factual errors (temporal drift, object hallucination, action mislabeling) reproduce the distribution of mistakes made by real video captioning models. No validation experiment comparing synthetic error patterns against outputs from deployed captioners is reported; without this, high benchmark scores may reflect overfitting to artificial templates rather than authentic failure modes.

Authors: We acknowledge the validity of this concern. The synthetic framework was constructed from error categories repeatedly documented in the video-captioning literature, yet we did not originally include a side-by-side distributional comparison with errors produced by deployed models. To address the point, the revised §3 now contains a new validation subsection. We generated captions with several open-source video captioners on a held-out video set, had the factual errors annotated by the same protocol used for the synthetic data, and report the resulting type distributions alongside those of the synthetic captions. The comparison shows substantial overlap while also noting residual differences, which we discuss as a limitation and direction for future refinement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation is self-contained against external benchmarks

full rationale

The paper introduces VC-Inspector as a new LMM trained on a custom synthetic caption dataset with injected factual errors, then measures its performance via correlation with human judgments on independent external benchmarks (VATEX-Eval, Flickr8K-Expert, Flickr8K-CF). No equations, derivations, or fitted parameters are shown that reduce the reported SOTA correlations to quantities defined inside the paper itself. The central empirical claim rests on training followed by testing against held-out human labels rather than any self-referential construction, self-citation chain, or ansatz smuggled via prior work. This matches the default case of a self-contained model trained and evaluated on external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that human judgments constitute the correct ground truth for factual accuracy and that the synthetic error-injection process produces a representative distribution of real-world caption mistakes.

axioms (2)

domain assumption Human judgments are the appropriate gold standard for factual correctness of video captions.
Invoked when the paper states that VC-Inspector aligns closely with human judgments.
domain assumption Captions generated by deliberately injecting controllable factual errors are representative of errors made by real captioning models.
Required for the training framework to transfer to real evaluation scenarios.

pith-pipeline@v0.9.0 · 5686 in / 1237 out tokens · 31224 ms · 2026-05-18T15:25:56.478035+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

score=1−# of changed objects & actions / total # of objects & actions =1−|R|/(|O|+|A|)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VC-Inspector … reference-free … factual grounding … instruction tuning on ActivityNet-FG-It

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 10 internal anchors

[1]

SPICE: Semantic Propositional Image Caption Evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic Propositional Image Cap- tion Evaluation, July 2016. arXiv:1607.08822 [cs]. 2, 6, 7 8

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Ze- sen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

METEOR: An Auto- matic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An Auto- matic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Mea- sures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michi...

work page 2005
[4]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6

work page 2015
[5]

CLAIR: Evaluating Image Captions with Large Language Models

David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Dar- rell, and John Canny. CLAIR: Evaluating Image Captions with Large Language Models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Process- ing, pages 13638–13646, Singapore, Dec. 2023. Association for Computational ...

work page 2023
[6]

Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023

David Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023. 6, 7

work page arXiv 2023
[7]

Learning to Evaluate Image Captioning

Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. Learning to Evaluate Image Captioning, June 2018. arXiv:1806.06422 [cs]. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Multi-modal transformer for video retrieval

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer, 2020. 1

work page 2020
[9]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A Reference-free Evaluation Metric for Image Captioning, Mar. 2022. arXiv:2104.08718 [cs]. 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Re- search, 47:853–899, 2013

Micah Hodosh, Peter Young, and Julia Hockenmaier. Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Re- search, 47:853–899, 2013. 2, 5, 7

work page 2013
[11]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5

work page 2022
[12]

LITA: Language instructed temporal-localization assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. LITA: Language instructed temporal-localization assistant. InECCV, 2024. 1

work page 2024
[13]

TIGEr: Text-to-Image Grounding for Image Caption Evalu- ation, Sept

Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jianfeng Gao. TIGEr: Text-to-Image Grounding for Image Caption Evalu- ation, Sept. 2019. arXiv:1909.02050 [cs]. 3

work page arXiv 2019
[14]

Dense-Captioning Events in Videos, May 2017

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-Captioning Events in Videos, May 2017. 1, 2, 4, 5, 6

work page 2017
[15]

Umic: An unreferenced metric for image captioning via contrastive learning.arXiv preprint arXiv:2106.14019, 2021

Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. Umic: An unreferenced metric for image captioning via contrastive learning.arXiv preprint arXiv:2106.14019, 2021. 3

work page arXiv 2021
[16]

ViL- BERTScore: Evaluating Image Caption Using Vision-and- Language BERT

Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. ViL- BERTScore: Evaluating Image Caption Using Vision-and- Language BERT. In Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, and Eduard Hovy, editors,Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 34–39, Online, Nov. 2020. Ass...

work page 2020
[17]

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model, June 2024

Yebin Lee, Imseong Park, and Myungjoo Kang. FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model, June 2024. arXiv:2406.06004 [cs]. 3

work page arXiv 2024
[18]

arXiv preprint arXiv:2305.17497 , year=

Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gho- lamreza Haffari, Fei Li, Donghong Ji, and Quan Hung Tran. FACTUAL: A Benchmark for Faithful and Consistent Tex- tual Scene Graph Parsing, June 2023. arXiv:2305.17497 [cs]. 2, 6, 7

work page arXiv 2023
[19]

ROUGE: A Package for Automatic Evalu- ation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evalu- ation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. 1, 2, 3, 5, 7

work page 2004
[20]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 5, 12

work page 2023
[21]

Models see hallucinations: Eval- uating the factuality in video captioning.arXiv preprint arXiv:2303.02961, 2023

Hui Liu and Xiaojun Wan. Models see hallucinations: Eval- uating the factuality in video captioning.arXiv preprint arXiv:2303.02961, 2023. 3, 6, 7

work page arXiv 2023
[22]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation us- ing GPT-4 with Better Human Alignment, May 2023. arXiv:2303.16634 [cs]. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019. 2

work page 2019
[24]

VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions

Pranava Madhyastha, Josiah Wang, and Lucia Specia. Vi- fidel: Evaluating the visual fidelity of image descriptions. arXiv preprint arXiv:1907.09340, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1907
[25]

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction, Feb

Koki Maeda, Shuhei Kurita, Taiki Miyanishi, and Naoaki Okazaki. Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction, Feb. 2024. arXiv:2402.17969 [cs]. 3

work page arXiv 2024
[26]

Bleu: a Method for Automatic Evaluation of Ma- chine Translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Ma- chine Translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meet- ing of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. As- sociation for C...

work page 2002
[27]

and Binsu C

Jeshmol P.J. and Binsu C. Kovoor. Video question answer- ing: A survey of the state-of-the-art.Journal of Visual Com- munication and Image Representation, 105:104320, 2024. 1 9

work page 2024
[28]

Momen- tor: Advancing video large language model with fine-grained temporal reasoning.ArXiv, abs/2402.11435, 2024

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning.ArXiv, abs/2402.11435, 2024. 1

work page arXiv 2024
[29]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, Feb. 2021. arXiv:2103.00020 [cs]. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Positive-augmented contrastive learning for image and video captioning evaluation

Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Positive-augmented contrastive learning for image and video captioning evaluation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6914–6924, 2023. 2, 3, 6, 7

work page 2023
[31]

EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Em- bedding Matching, July 2022

Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, and Zheng-Jun Zha. EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Em- bedding Matching, July 2022. arXiv:2111.08919 [cs]. 1, 2, 3, 6, 7, 8

work page arXiv 2022
[32]

G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o

Tony Cheng Tong, Sirui He, Zhiwen Shao, and Dit-Yan Ye- ung. G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o. InAAAI, Mar. 2025. 3, 4, 6, 7

work page 2025
[33]

CIDEr: Consensus-based Image Description Evaluation

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based Image Description Eval- uation, June 2015. arXiv:1411.5726 [cs]. 1, 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2015
[34]

Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning, Feb

Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sug- iura. Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning, Feb. 2024. arXiv:2402.18091 [cs]. 3

work page arXiv 2024
[35]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, Jan

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, Jan

work page
[36]

arXiv:2307.06942. 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Vera Liao

Ziang Xiao, Susu Zhang, Vivian Lai, and Q. Vera Liao. Eval- uating evaluation metrics: A framework for analyzing NLG evaluation metrics using measurement theory. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10967–10982, Singapore, Dec

work page 2023
[38]

Association for Computational Linguistics. 8

work page
[39]

Improving image captioning evaluation by considering inter references vari- ance

Yanzhi Yi, Hangyu Deng, and Jinglu Hu. Improving image captioning evaluation by considering inter references vari- ance. InProceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 985–994, 2020. 2

work page 2020
[40]

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie, Zhengjue Wang, and Bo Chen. HICEScore: A Hierarchical Metric for Image Captioning Evaluation. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 866–875, Oct. 2024. arXiv:2407.18589 [cs]. 3

work page arXiv 2024
[41]

Wein- berger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. Bertscore: Evaluating text gener- ation with bert. InInternational Conference on Learning Representations, 2020. 1, 6, 7

work page 2020
[42]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. BERTScore: Evaluating Text Gen- eration with BERT, Feb. 2020. arXiv:1904.09675 [cs]. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2020
[43]

Object1",

Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. InAAAI Conference on Artificial Intelligence, 2018. 1, 5 10 Appendix The appendix is organized as follows. Section A lists the detailed prompts used in the proposed data genera- tion pipeline, model fine-tuning, and explanation evalua- tio...

work page 2018

[1] [1]

SPICE: Semantic Propositional Image Caption Evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic Propositional Image Cap- tion Evaluation, July 2016. arXiv:1607.08822 [cs]. 2, 6, 7 8

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Ze- sen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

METEOR: An Auto- matic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An Auto- matic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Mea- sures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michi...

work page 2005

[4] [4]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6

work page 2015

[5] [5]

CLAIR: Evaluating Image Captions with Large Language Models

David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Dar- rell, and John Canny. CLAIR: Evaluating Image Captions with Large Language Models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Process- ing, pages 13638–13646, Singapore, Dec. 2023. Association for Computational ...

work page 2023

[6] [6]

Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023

David Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023. 6, 7

work page arXiv 2023

[7] [7]

Learning to Evaluate Image Captioning

Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. Learning to Evaluate Image Captioning, June 2018. arXiv:1806.06422 [cs]. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Multi-modal transformer for video retrieval

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer, 2020. 1

work page 2020

[9] [9]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A Reference-free Evaluation Metric for Image Captioning, Mar. 2022. arXiv:2104.08718 [cs]. 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Re- search, 47:853–899, 2013

Micah Hodosh, Peter Young, and Julia Hockenmaier. Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Re- search, 47:853–899, 2013. 2, 5, 7

work page 2013

[11] [11]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5

work page 2022

[12] [12]

LITA: Language instructed temporal-localization assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. LITA: Language instructed temporal-localization assistant. InECCV, 2024. 1

work page 2024

[13] [13]

TIGEr: Text-to-Image Grounding for Image Caption Evalu- ation, Sept

Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jianfeng Gao. TIGEr: Text-to-Image Grounding for Image Caption Evalu- ation, Sept. 2019. arXiv:1909.02050 [cs]. 3

work page arXiv 2019

[14] [14]

Dense-Captioning Events in Videos, May 2017

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-Captioning Events in Videos, May 2017. 1, 2, 4, 5, 6

work page 2017

[15] [15]

Umic: An unreferenced metric for image captioning via contrastive learning.arXiv preprint arXiv:2106.14019, 2021

Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. Umic: An unreferenced metric for image captioning via contrastive learning.arXiv preprint arXiv:2106.14019, 2021. 3

work page arXiv 2021

[16] [16]

ViL- BERTScore: Evaluating Image Caption Using Vision-and- Language BERT

Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. ViL- BERTScore: Evaluating Image Caption Using Vision-and- Language BERT. In Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, and Eduard Hovy, editors,Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 34–39, Online, Nov. 2020. Ass...

work page 2020

[17] [17]

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model, June 2024

Yebin Lee, Imseong Park, and Myungjoo Kang. FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model, June 2024. arXiv:2406.06004 [cs]. 3

work page arXiv 2024

[18] [18]

arXiv preprint arXiv:2305.17497 , year=

Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gho- lamreza Haffari, Fei Li, Donghong Ji, and Quan Hung Tran. FACTUAL: A Benchmark for Faithful and Consistent Tex- tual Scene Graph Parsing, June 2023. arXiv:2305.17497 [cs]. 2, 6, 7

work page arXiv 2023

[19] [19]

ROUGE: A Package for Automatic Evalu- ation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evalu- ation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. 1, 2, 3, 5, 7

work page 2004

[20] [20]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 5, 12

work page 2023

[21] [21]

Models see hallucinations: Eval- uating the factuality in video captioning.arXiv preprint arXiv:2303.02961, 2023

Hui Liu and Xiaojun Wan. Models see hallucinations: Eval- uating the factuality in video captioning.arXiv preprint arXiv:2303.02961, 2023. 3, 6, 7

work page arXiv 2023

[22] [22]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation us- ing GPT-4 with Better Human Alignment, May 2023. arXiv:2303.16634 [cs]. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019. 2

work page 2019

[24] [24]

VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions

Pranava Madhyastha, Josiah Wang, and Lucia Specia. Vi- fidel: Evaluating the visual fidelity of image descriptions. arXiv preprint arXiv:1907.09340, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1907

[25] [25]

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction, Feb

Koki Maeda, Shuhei Kurita, Taiki Miyanishi, and Naoaki Okazaki. Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction, Feb. 2024. arXiv:2402.17969 [cs]. 3

work page arXiv 2024

[26] [26]

Bleu: a Method for Automatic Evaluation of Ma- chine Translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Ma- chine Translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meet- ing of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. As- sociation for C...

work page 2002

[27] [27]

and Binsu C

Jeshmol P.J. and Binsu C. Kovoor. Video question answer- ing: A survey of the state-of-the-art.Journal of Visual Com- munication and Image Representation, 105:104320, 2024. 1 9

work page 2024

[28] [28]

Momen- tor: Advancing video large language model with fine-grained temporal reasoning.ArXiv, abs/2402.11435, 2024

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning.ArXiv, abs/2402.11435, 2024. 1

work page arXiv 2024

[29] [29]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, Feb. 2021. arXiv:2103.00020 [cs]. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

Positive-augmented contrastive learning for image and video captioning evaluation

Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Positive-augmented contrastive learning for image and video captioning evaluation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6914–6924, 2023. 2, 3, 6, 7

work page 2023

[31] [31]

EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Em- bedding Matching, July 2022

Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, and Zheng-Jun Zha. EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Em- bedding Matching, July 2022. arXiv:2111.08919 [cs]. 1, 2, 3, 6, 7, 8

work page arXiv 2022

[32] [32]

G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o

Tony Cheng Tong, Sirui He, Zhiwen Shao, and Dit-Yan Ye- ung. G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o. InAAAI, Mar. 2025. 3, 4, 6, 7

work page 2025

[33] [33]

CIDEr: Consensus-based Image Description Evaluation

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based Image Description Eval- uation, June 2015. arXiv:1411.5726 [cs]. 1, 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2015

[34] [34]

Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning, Feb

Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sug- iura. Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning, Feb. 2024. arXiv:2402.18091 [cs]. 3

work page arXiv 2024

[35] [35]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, Jan

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, Jan

work page

[36] [36]

arXiv:2307.06942. 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Vera Liao

Ziang Xiao, Susu Zhang, Vivian Lai, and Q. Vera Liao. Eval- uating evaluation metrics: A framework for analyzing NLG evaluation metrics using measurement theory. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10967–10982, Singapore, Dec

work page 2023

[38] [38]

Association for Computational Linguistics. 8

work page

[39] [39]

Improving image captioning evaluation by considering inter references vari- ance

Yanzhi Yi, Hangyu Deng, and Jinglu Hu. Improving image captioning evaluation by considering inter references vari- ance. InProceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 985–994, 2020. 2

work page 2020

[40] [40]

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie, Zhengjue Wang, and Bo Chen. HICEScore: A Hierarchical Metric for Image Captioning Evaluation. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 866–875, Oct. 2024. arXiv:2407.18589 [cs]. 3

work page arXiv 2024

[41] [41]

Wein- berger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. Bertscore: Evaluating text gener- ation with bert. InInternational Conference on Learning Representations, 2020. 1, 6, 7

work page 2020

[42] [42]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. BERTScore: Evaluating Text Gen- eration with BERT, Feb. 2020. arXiv:1904.09675 [cs]. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2020

[43] [43]

Object1",

Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. InAAAI Conference on Artificial Intelligence, 2018. 1, 5 10 Appendix The appendix is organized as follows. Section A lists the detailed prompts used in the proposed data genera- tion pipeline, model fine-tuning, and explanation evalua- tio...

work page 2018