VideoSET: Video Summary Evaluation through Text

Alireza Fathi; Li Fei-Fei; Serena Yeung

arxiv: 1406.5824 · v1 · pith:Z44WNKEPnew · submitted 2014-06-23 · 💻 cs.CV · cs.CL· cs.IR

VideoSET: Video Summary Evaluation through Text

Serena Yeung , Alireza Fathi , Li Fei-Fei This is my paper

classification 💻 cs.CV cs.CLcs.IR

keywords videotextsummaryevaluationdistanceground-truthsemanticsummaries

0 comments

read the original abstract

In this paper we present VideoSET, a method for Video Summary Evaluation through Text that can evaluate how well a video summary is able to retain the semantic information contained in its original video. We observe that semantics is most easily expressed in words, and develop a text-based approach for the evaluation. Given a video summary, a text representation of the video summary is first generated, and an NLP-based metric is then used to measure its semantic distance to ground-truth text summaries written by humans. We show that our technique has higher agreement with human judgment than pixel-based distance metrics. We also release text annotations and ground-truth text summaries for a number of publicly available video datasets, for use by the computer vision community.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets
cs.CV 2025-10 conditional novelty 5.0

SD-MVSum extends script-driven video summarization to multimodal inputs by modeling script-video and script-transcript relevance with a new weighted cross-modal attention mechanism, plus extended S-VideoXum and MrHiSu...