pith. sign in

arxiv: 2509.16538 · v3 · submitted 2025-09-20 · 💻 cs.CV · cs.CL

VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis

Pith reviewed 2026-05-18 15:25 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords video caption evaluationfactual accuracyreference-free metriclarge multimodal modelsynthetic error injectionhuman judgment correlation
0
0 comments X

The pith

VC-Inspector evaluates video captions for factual accuracy without reference texts by learning from synthetic errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops VC-Inspector, a lightweight open-source multimodal model that assesses the factual correctness of video descriptions without needing a correct reference caption. It creates training examples by injecting specific factual mistakes into captions and provides the model with quality scores and explanations for those errors. This setup allows the model to produce evaluations that match human opinions more closely than earlier methods across several test sets. Readers would value this because reliable automatic checks can help researchers build better video captioning systems that avoid factual mistakes.

Core claim

VC-Inspector is a large multimodal model trained to perform reference-free evaluation of video captions with a focus on factual accuracy. Using a systematic method to generate captions containing controllable factual errors along with graded quality scores and explanatory annotations, the model learns to deliver assessments that show state-of-the-art correlation with human judgments on benchmarks including VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF.

What carries the argument

The systematic framework for generating video captions with controllable factual errors, paired with graded quality scores and explanatory annotations, which enables supervised training of the evaluation model.

If this is right

  • Delivers reproducible evaluations that do not depend on proprietary language models or services.
  • Identifies specific factual issues in captions to guide improvements in video description models.
  • Maintains strong performance when applied to caption datasets from varied sources and domains.
  • Supplies explanatory annotations alongside scores for greater interpretability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar synthetic error injection could improve evaluation for other generative tasks such as text summarization.
  • Using the model as a critic during training of caption generators might reduce factual hallucinations in produced video descriptions.
  • Extending the approach to longer video sequences could test its limits in handling extended context.

Load-bearing premise

The synthetic captions with injected factual errors closely resemble the factual mistakes produced by real video captioning models.

What would settle it

Collecting human judgments on factual accuracy for captions generated by current video models and measuring if VC-Inspector's correlation with those judgments falls substantially below the reported levels on the existing benchmarks.

Figures

Figures reproduced from arXiv: 2509.16538 by Shubhashis Roy Dipta, Subarna Tripathi, Tz-Ying Wu.

Figure 1
Figure 1. Figure 1: Existing reference-free metrics like EMScore [31] of￾ten fail to detect factual inaccuracies and lack a consistent scor￾ing scale. VC-Inspector addresses these limitations by pro￾viding factually grounded, interpretable evaluations with expla￾nations. tent retrieval [8]. To ensure the quality and reliability of captions, robust evaluation metrics are vital for advanc￾ing the field of video-language researc… view at source ↗
Figure 2
Figure 2. Figure 2: (left) We present a data generation pipeline designed to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Data generation pipeline to create a synthetic dataset for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual examples from ActivityNet-FG-Eval (top) and VATEX-Eval (others). VC-Inspector produces quality as￾sessments highly consistent with ground truth scores, along with explanatory insights into factual errors (highlighted in red). captions. We leverage these explanations as an auxiliary su￾pervision signal during training, enabling the model to im￾plicitly learn factual grounding, which has been shown to… view at source ↗
Figure 5
Figure 5. Figure 5: Additional visual examples on VATEX-Eval. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional visualization of VC-Inspector results on ActivityNet-FG-Eval videos, with candidate captions of diverse quality. Incorrect objects and actions are identified by VC-Inspector and labeled in red. In [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible and fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic framework for generating captions with controllable factual errors, paired with graded quality scores and explanatory annotations. Experiments demonstrate that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement. Project page is available at https://dipta007.github.io/VC-Inspector

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VC-Inspector, a lightweight open-source large multimodal model (LMM) for reference-free evaluation of video captions with emphasis on factual accuracy. It introduces a systematic framework for generating synthetic captions containing controllable factual errors, accompanied by graded quality scores and explanatory annotations. The model is trained on this data and evaluated on multiple benchmarks, claiming state-of-the-art correlation with human judgments along with generalization across domains such as VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF, while also indicating utility for caption improvement.

Significance. If the central claims are substantiated, this work would supply a reproducible, open-source alternative to reference-dependent or proprietary metrics for assessing factual correctness in video captions. The synthetic generation framework with graded annotations and explanations represents a constructive contribution for interpretable training of factuality evaluators. These elements could support more reliable video understanding pipelines without reliance on external services.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments): The manuscript asserts state-of-the-art correlation with human judgments and generalization across VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF, yet provides no specific quantitative correlation values, ablation studies, or error-bar information. This absence makes it impossible to verify the superiority and robustness claims that form the core experimental contribution.
  2. [§3 (Synthetic Caption Generation Framework)] §3 (Synthetic Caption Generation Framework): The central generalization claim rests on the assumption that the injected factual errors (temporal drift, object hallucination, action mislabeling) reproduce the distribution of mistakes made by real video captioning models. No validation experiment comparing synthetic error patterns against outputs from deployed captioners is reported; without this, high benchmark scores may reflect overfitting to artificial templates rather than authentic failure modes.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'state-of-the-art correlation' is stated without any numerical values; including the key Pearson or Spearman coefficients would strengthen the summary.
  2. [Throughout] Notation and terminology: Acronyms such as LMM, VATEX-Eval, and Flickr8K-CF should be defined on first use in the main text for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below and revised the manuscript to strengthen the experimental reporting and validation of the synthetic framework.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments): The manuscript asserts state-of-the-art correlation with human judgments and generalization across VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF, yet provides no specific quantitative correlation values, ablation studies, or error-bar information. This absence makes it impossible to verify the superiority and robustness claims that form the core experimental contribution.

    Authors: We thank the referee for this observation. While correlation results appear in tables in the original §4, we agree that explicit numerical values, ablations, and error bars were not presented with sufficient prominence or detail in the text. In the revised manuscript we have expanded §4 with a new table that reports exact Pearson and Spearman correlations for VC-Inspector and all baselines on VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF. We have also added ablation studies on the contribution of each factual-error category and on model components, together with standard deviations computed over five independent training runs to supply error-bar information. These additions directly support verification of the reported superiority and robustness. revision: yes

  2. Referee: [§3 (Synthetic Caption Generation Framework)] §3 (Synthetic Caption Generation Framework): The central generalization claim rests on the assumption that the injected factual errors (temporal drift, object hallucination, action mislabeling) reproduce the distribution of mistakes made by real video captioning models. No validation experiment comparing synthetic error patterns against outputs from deployed captioners is reported; without this, high benchmark scores may reflect overfitting to artificial templates rather than authentic failure modes.

    Authors: We acknowledge the validity of this concern. The synthetic framework was constructed from error categories repeatedly documented in the video-captioning literature, yet we did not originally include a side-by-side distributional comparison with errors produced by deployed models. To address the point, the revised §3 now contains a new validation subsection. We generated captions with several open-source video captioners on a held-out video set, had the factual errors annotated by the same protocol used for the synthetic data, and report the resulting type distributions alongside those of the synthetic captions. The comparison shows substantial overlap while also noting residual differences, which we discuss as a limitation and direction for future refinement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation is self-contained against external benchmarks

full rationale

The paper introduces VC-Inspector as a new LMM trained on a custom synthetic caption dataset with injected factual errors, then measures its performance via correlation with human judgments on independent external benchmarks (VATEX-Eval, Flickr8K-Expert, Flickr8K-CF). No equations, derivations, or fitted parameters are shown that reduce the reported SOTA correlations to quantities defined inside the paper itself. The central empirical claim rests on training followed by testing against held-out human labels rather than any self-referential construction, self-citation chain, or ansatz smuggled via prior work. This matches the default case of a self-contained model trained and evaluated on external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that human judgments constitute the correct ground truth for factual accuracy and that the synthetic error-injection process produces a representative distribution of real-world caption mistakes.

axioms (2)
  • domain assumption Human judgments are the appropriate gold standard for factual correctness of video captions.
    Invoked when the paper states that VC-Inspector aligns closely with human judgments.
  • domain assumption Captions generated by deliberately injecting controllable factual errors are representative of errors made by real captioning models.
    Required for the training framework to transfer to real evaluation scenarios.

pith-pipeline@v0.9.0 · 5686 in / 1237 out tokens · 31224 ms · 2026-05-18T15:25:56.478035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 10 internal anchors

  1. [1]

    SPICE: Semantic Propositional Image Caption Evaluation

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. SPICE: Semantic Propositional Image Cap- tion Evaluation, July 2016. arXiv:1607.08822 [cs]. 2, 6, 7 8

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Ze- sen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Repor...

  3. [3]

    METEOR: An Auto- matic Metric for MT Evaluation with Improved Correlation with Human Judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An Auto- matic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Mea- sures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michi...

  4. [4]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6

  5. [5]

    CLAIR: Evaluating Image Captions with Large Language Models

    David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Dar- rell, and John Canny. CLAIR: Evaluating Image Captions with Large Language Models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Process- ing, pages 13638–13646, Singapore, Dec. 2023. Association for Computational ...

  6. [6]

    Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023

    David Chan, Suzanne Petryk, Joseph E Gonzalez, Trevor Darrell, and John Canny. Clair: Evaluating image captions with large language models.arXiv preprint arXiv:2310.12971, 2023. 6, 7

  7. [7]

    Learning to Evaluate Image Captioning

    Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. Learning to Evaluate Image Captioning, June 2018. arXiv:1806.06422 [cs]. 3

  8. [8]

    Multi-modal transformer for video retrieval

    Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer, 2020. 1

  9. [9]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A Reference-free Evaluation Metric for Image Captioning, Mar. 2022. arXiv:2104.08718 [cs]. 3, 6, 7

  10. [10]

    Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Re- search, 47:853–899, 2013

    Micah Hodosh, Peter Young, and Julia Hockenmaier. Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Re- search, 47:853–899, 2013. 2, 5, 7

  11. [11]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5

  12. [12]

    LITA: Language instructed temporal-localization assistant

    De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. LITA: Language instructed temporal-localization assistant. InECCV, 2024. 1

  13. [13]

    TIGEr: Text-to-Image Grounding for Image Caption Evalu- ation, Sept

    Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jianfeng Gao. TIGEr: Text-to-Image Grounding for Image Caption Evalu- ation, Sept. 2019. arXiv:1909.02050 [cs]. 3

  14. [14]

    Dense-Captioning Events in Videos, May 2017

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-Captioning Events in Videos, May 2017. 1, 2, 4, 5, 6

  15. [15]

    Umic: An unreferenced metric for image captioning via contrastive learning.arXiv preprint arXiv:2106.14019, 2021

    Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. Umic: An unreferenced metric for image captioning via contrastive learning.arXiv preprint arXiv:2106.14019, 2021. 3

  16. [16]

    ViL- BERTScore: Evaluating Image Caption Using Vision-and- Language BERT

    Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. ViL- BERTScore: Evaluating Image Caption Using Vision-and- Language BERT. In Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, and Eduard Hovy, editors,Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 34–39, Online, Nov. 2020. Ass...

  17. [17]

    FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model, June 2024

    Yebin Lee, Imseong Park, and Myungjoo Kang. FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model, June 2024. arXiv:2406.06004 [cs]. 3

  18. [18]

    arXiv preprint arXiv:2305.17497 , year=

    Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gho- lamreza Haffari, Fei Li, Donghong Ji, and Quan Hung Tran. FACTUAL: A Benchmark for Faithful and Consistent Tex- tual Scene Graph Parsing, June 2023. arXiv:2305.17497 [cs]. 2, 6, 7

  19. [19]

    ROUGE: A Package for Automatic Evalu- ation of Summaries

    Chin-Yew Lin. ROUGE: A Package for Automatic Evalu- ation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. 1, 2, 3, 5, 7

  20. [20]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 5, 12

  21. [21]

    Models see hallucinations: Eval- uating the factuality in video captioning.arXiv preprint arXiv:2303.02961, 2023

    Hui Liu and Xiaojun Wan. Models see hallucinations: Eval- uating the factuality in video captioning.arXiv preprint arXiv:2303.02961, 2023. 3, 6, 7

  22. [22]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation us- ing GPT-4 with Better Human Alignment, May 2023. arXiv:2303.16634 [cs]. 3

  23. [23]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019. 2

  24. [24]

    VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions

    Pranava Madhyastha, Josiah Wang, and Lucia Specia. Vi- fidel: Evaluating the visual fidelity of image descriptions. arXiv preprint arXiv:1907.09340, 2019. 2

  25. [25]

    Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction, Feb

    Koki Maeda, Shuhei Kurita, Taiki Miyanishi, and Naoaki Okazaki. Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction, Feb. 2024. arXiv:2402.17969 [cs]. 3

  26. [26]

    Bleu: a Method for Automatic Evaluation of Ma- chine Translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Ma- chine Translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meet- ing of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. As- sociation for C...

  27. [27]

    and Binsu C

    Jeshmol P.J. and Binsu C. Kovoor. Video question answer- ing: A survey of the state-of-the-art.Journal of Visual Com- munication and Image Representation, 105:104320, 2024. 1 9

  28. [28]

    Momen- tor: Advancing video large language model with fine-grained temporal reasoning.ArXiv, abs/2402.11435, 2024

    Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning.ArXiv, abs/2402.11435, 2024. 1

  29. [29]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, Feb. 2021. arXiv:2103.00020 [cs]. 2, 3

  30. [30]

    Positive-augmented contrastive learning for image and video captioning evaluation

    Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Positive-augmented contrastive learning for image and video captioning evaluation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6914–6924, 2023. 2, 3, 6, 7

  31. [31]

    EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Em- bedding Matching, July 2022

    Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, and Zheng-Jun Zha. EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Em- bedding Matching, July 2022. arXiv:2111.08919 [cs]. 1, 2, 3, 6, 7, 8

  32. [32]

    G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o

    Tony Cheng Tong, Sirui He, Zhiwen Shao, and Dit-Yan Ye- ung. G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o. InAAAI, Mar. 2025. 3, 4, 6, 7

  33. [33]

    CIDEr: Consensus-based Image Description Evaluation

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based Image Description Eval- uation, June 2015. arXiv:1411.5726 [cs]. 1, 2, 5, 7

  34. [34]

    Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning, Feb

    Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sug- iura. Polos: Multimodal Metric Learning from Human Feed- back for Image Captioning, Feb. 2024. arXiv:2402.18091 [cs]. 3

  35. [35]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, Jan

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, Jan

  36. [36]

    arXiv:2307.06942. 6, 7, 8

  37. [37]

    Vera Liao

    Ziang Xiao, Susu Zhang, Vivian Lai, and Q. Vera Liao. Eval- uating evaluation metrics: A framework for analyzing NLG evaluation metrics using measurement theory. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10967–10982, Singapore, Dec

  38. [38]

    Association for Computational Linguistics. 8

  39. [39]

    Improving image captioning evaluation by considering inter references vari- ance

    Yanzhi Yi, Hangyu Deng, and Jinglu Hu. Improving image captioning evaluation by considering inter references vari- ance. InProceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 985–994, 2020. 2

  40. [40]

    HICEScore: A Hierarchical Metric for Image Captioning Evaluation

    Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie, Zhengjue Wang, and Bo Chen. HICEScore: A Hierarchical Metric for Image Captioning Evaluation. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 866–875, Oct. 2024. arXiv:2407.18589 [cs]. 3

  41. [41]

    Wein- berger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. Bertscore: Evaluating text gener- ation with bert. InInternational Conference on Learning Representations, 2020. 1, 6, 7

  42. [42]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. BERTScore: Evaluating Text Gen- eration with BERT, Feb. 2020. arXiv:1904.09675 [cs]. 2, 3

  43. [43]

    Object1",

    Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. InAAAI Conference on Artificial Intelligence, 2018. 1, 5 10 Appendix The appendix is organized as follows. Section A lists the detailed prompts used in the proposed data genera- tion pipeline, model fine-tuning, and explanation evalua- tio...