SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets

Charalampia Zerva; Evlampios Apostolidis; Manolis Mylonas; Vasileios Mezaris

arxiv: 2510.05652 · v2 · submitted 2025-10-07 · 💻 cs.CV

SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets

Manolis Mylonas , Charalampia Zerva , Evlampios Apostolidis , Vasileios Mezaris This is my paper

Pith reviewed 2026-05-18 08:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords video summarizationmultimodal video summarizationscript-driven summarizationcross-modal attentionsemantic similarityvideo datasets

0 comments

The pith

The SD-MVSum method uses a weighted cross-modal attention mechanism to create video summaries that align with a user script by considering both visual and spoken content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops SD-MVSum for script-driven multimodal video summarization. It extends previous work by adding the transcript to the visual content and modeling their separate relations to the script with a new weighted cross-modal attention. This attention uses semantic similarity to highlight the video segments that best fit the script. The authors also enlarge the S-VideoXum and MrHiSum datasets to include the needed multimodal data. Results show the method is competitive with state-of-the-art techniques for both script-based and generic video summarization.

Core claim

SD-MVSum builds on the SD-VSum method by incorporating the audio transcript modality and using a weighted cross-modal attention mechanism that exploits semantic similarity between the script-video pair and the script-transcript pair to promote the parts of the video with highest relevance to the user script. The method and the extended datasets demonstrate competitive performance against other state-of-the-art approaches.

What carries the argument

The weighted cross-modal attention mechanism that exploits semantic similarity between the script and each of the visual and transcript modalities to select relevant video segments.

If this is right

Video summaries can better reflect user scripts that refer to dialogue.
The approach works across script-driven and generic video summarization tasks.
New extended datasets facilitate development of multimodal summarization methods.
Competitive results are obtained without additional post-processing or tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This mechanism might be adapted for other tasks like video question answering where text queries need to match audio and visuals.
It highlights the value of semantic matching over simple concatenation of modalities.
Testing with diverse, real-world user scripts could show the limits of the similarity-based selection.

Load-bearing premise

The semantic similarity computed by the attention mechanism accurately captures the relevance of video segments to the user script.

What would settle it

An experiment where the full SD-MVSum is compared to versions without the transcript or without the weighting, and the latter perform equally well or better according to standard metrics or human judgment on the datasets.

Figures

Figures reproduced from arXiv: 2510.05652 by Charalampia Zerva, Evlampios Apostolidis, Manolis Mylonas, Vasileios Mezaris.

**Figure 1.** Figure 1: Overview of the SD-MVSum network architecture. Given an input video, a user script about the content of the summary, and a set [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: The processing pipeline in the weighted cross-modal attention mechanism when fusing the visual and the script embeddings. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the processing pipeline for creating the S-MrHiSum dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: An indicative sample from our qualitative analysis. The upper part provides a keyframe-based representation of the content of [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

In this work, we present a method and two large-scale datasets for Script-Driven Multimodal Video Summarization. The proposed method, SD-MVSum, builds on our earlier SD-VSum method for script-driven video summarization, which considered just the visual content of the video. SD-MVSum takes into account, in addition to the visual modality, the relevance of the user-provided script with the spoken content (i.e., audio transcript) of the video. The dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This mechanism explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for script-driven (S-VideoXum) and generic (MrHiSum) video summarization, to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of the proposed SD-MVSum method against other SotA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends the authors' prior SD-VSum with a weighted cross-modal attention for script-video and script-transcript pairs plus dataset extensions, and reports competitive results on summarization tasks.

read the letter

The core advance is the shift from visual-only script-driven summarization to a multimodal version that also weighs the transcript. The new weighted cross-modal attention uses semantic similarity to boost script-relevant segments across both pairs, and the authors extend S-VideoXum and MrHiSum to support this setup. They release code and data, which is useful for anyone wanting to reproduce or build on it. The experimental section claims the method holds up against SotA on both script-driven and generic summarization benchmarks, and the construction looks internally consistent with no obvious circularity to the earlier paper alone. That said, the abstract leaves the exact metrics, baseline choices, and ablation depth a bit thin, so the competitiveness claim would benefit from clearer statistical backing in the full version. The scope stays practical and focused on media tools rather than claiming wider theoretical shifts. This paper is mainly for people already working in video summarization or multimodal retrieval who need script-conditioned outputs. A reader in that subfield would find the attention mechanism and the extended datasets worth examining. I would send it to peer review because the contribution is grounded, the code is public, and the evaluation protocol is direct enough for referees to assess properly.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes SD-MVSum, a method for script-driven multimodal video summarization that extends the authors' prior SD-VSum work. SD-MVSum incorporates both visual and transcript modalities by modeling pairwise dependencies (script-video and script-transcript) with a new weighted cross-modal attention mechanism that exploits semantic similarity to promote script-relevant segments. The authors extend the S-VideoXum and MrHiSum datasets to support multimodal script-driven summarization and report that SD-MVSum achieves competitive performance against state-of-the-art methods on both script-driven and generic video summarization tasks. Public code and datasets are released.

Significance. If the reported competitiveness holds under rigorous evaluation, the work advances script-driven video summarization by adding a transcript modality fused via semantically motivated attention. The dataset extensions address a resource gap for multimodal settings, and the public code plus direct SotA comparisons constitute clear strengths that support reproducibility and independent verification. This could inform practical systems for personalized or content-aware video summarization.

minor comments (3)

The abstract states that experimental comparisons document competitiveness, but it would improve immediate clarity to name the primary metrics (e.g., F1 or mAP) and the main baseline families used.
In the method description, the precise formulation of the weighting in the cross-modal attention (how semantic similarity scores are normalized and applied) would benefit from an explicit equation or pseudocode block for reproducibility.
When presenting the extended S-VideoXum and MrHiSum datasets, quantitative statistics on the added transcript coverage and any new annotation protocol should be included to allow readers to judge the scale of the multimodal extension.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our work on SD-MVSum, including the recognition of the weighted cross-modal attention mechanism, the dataset extensions, and the public release of code and data. We appreciate the recommendation for minor revision and the acknowledgment of the potential impact on practical systems for script-driven summarization.

Circularity Check

0 steps flagged

Minor self-citation to prior SD-VSum; new attention mechanism and experiments supply independent content

full rationale

The manuscript explicitly builds on the authors' earlier SD-VSum for the visual-only case but introduces a distinct weighted cross-modal attention mechanism to model script-transcript relevance in addition to script-video. Competitiveness is demonstrated via direct experimental comparisons against SotA methods on the extended S-VideoXum and MrHiSum datasets, with public code released. No load-bearing step reduces a prediction to a fitted input by construction, no self-definitional loop appears in the architecture description, and the self-citation is not invoked to forbid alternatives or to justify the core multimodal claim. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard deep-learning assumptions for attention and multimodal fusion; no explicit free parameters, new axioms, or invented entities are described in the abstract.

axioms (1)

standard math Standard assumptions in neural network training and attention mechanisms for semantic similarity computation.
The weighted cross-modal attention presupposes typical properties of learned embeddings and similarity metrics in multimodal models.

pith-pipeline@v0.9.0 · 5774 in / 1212 out tokens · 39780 ms · 2026-05-18T08:58:39.798096+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Nat- sev, George Toderici, Balakrishnan Varadarajan, and Sud- heendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark.CoRR, abs/1609.08675,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Metsai, Vasileios Mezaris, and Ioannis Patras

Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, and Ioannis Patras. Video sum- marization using deep neural networks: A survey.Proceed- ings of the IEEE, 109(11):1838–1863, 2021. 3

work page 2021
[3]

Combining global and local at- tention with positional encoding for video summarization

Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, and Ioannis Patras. Combining global and local at- tention with positional encoding for video summarization. In 2021 IEEE International Symposium on Multimedia (ISM), pages 226–234, 2021. 6

work page 2021
[4]

Scaling Up Video Summarization Pretraining with Large Language Models

Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heil- bron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, and Joon Son Chung. Scaling Up Video Summarization Pretraining with Large Language Models . In2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 8332–8341, Los Alamitos, CA, USA, 2024. IEEE Computer...

work page 2024
[5]

Lever- aging semantic saliency maps for query-specific video sum- marization.Multimedia Tools Appl., 81(12):17457–17482,

Kemal Cizmeciler, Erkut Erdem, and Aykut Erdem. Lever- aging semantic saliency maps for query-specific video sum- marization.Multimedia Tools Appl., 81(12):17457–17482,

work page
[6]

MM-A VS: A full- scale dataset for multi-modal summarization

Xiyan Fu, Jun Wang, and Zhenglu Yang. MM-A VS: A full- scale dataset for multi-modal summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 5922–5926, Online, 2021. Asso- ciation for Computational Linguistics. 2

work page 2021
[7]

Creating summaries from user videos

Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. Creating summaries from user videos. InComputer Vision – ECCV 2014, pages 505–520, Cham,

work page 2014
[8]

Springer International Publishing. 2

work page
[9]

Align and Attend: Multi- modal Summarization with Dual Contrastive Losses

Bo He, Jun Wang, Jielin Qiu, Trung Bui, Abhinav Shri- vastava, and Zhaowen Wang. Align and Attend: Multi- modal Summarization with Dual Contrastive Losses . In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 14867–14878, Los Alami- tos, CA, USA, 2023. IEEE Computer Society. 2, 6

work page 2023
[10]

Query-controllable video summarization

Jia-Hong Huang and Marcel Worring. Query-controllable video summarization. InProceedings of the 2020 Interna- tional Conference on Multimedia Retrieval, page 242–250, New York, NY , USA, 2020. Association for Computing Ma- chinery. 1

work page 2020
[11]

Query-based video summarization with pseudo label supervision

Jia-Hong Huang, Luka Murn, Marta Mrak, and Marcel Wor- ring. Query-based video summarization with pseudo label supervision. In2023 IEEE International Conference on Im- age Processing (ICIP), pages 1430–1434, 2023. 2

work page 2023
[12]

Hierarchical variational network for user-diversified & query-focused video summarization

Pin Jiang and Yahong Han. Hierarchical variational network for user-diversified & query-focused video summarization. InProceedings of the 2019 on International Conference on Multimedia Retrieval, page 202–206, New York, NY , USA,

work page 2019
[13]

Association for Computing Machinery. 2

work page
[14]

The treatment of ties in ranking prob- lems.Biometrika, 33(3):239–251, 1945

Maurice G Kendall. The treatment of ties in ranking prob- lems.Biometrika, 33(3):239–251, 1945. 6

work page 1945
[15]

Crc Press,

Stephen Kokoska and Daniel Zwillinger.CRC standard probability and statistics tables and formulae. Crc Press,

work page
[16]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In2017 IEEE International Conference on Computer Vision (ICCV), pages 706–715, 2017. 2, 3

work page 2017
[17]

Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024. 5

work page 2024
[18]

Videoxum: Cross- modal visual and textural summarization of videos.IEEE Transactions on Multimedia, 26:5548–5560, 2024

Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, and Jiebo Luo. Videoxum: Cross- modal visual and textural summarization of videos.IEEE Transactions on Multimedia, 26:5548–5560, 2024. 3, 5, 6

work page 2024
[19]

Sd-vsum: A method and dataset for script-driven video summarization, 2025

Manolis Mylonas, Evlampios Apostolidis, and Vasileios Mezaris. Sd-vsum: A method and dataset for script-driven video summarization, 2025. 1, 2, 3, 6, 7

work page 2025
[20]

Clip-it! language-guided video summarization

Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. Clip-it! language-guided video summarization. InProceed- ings of the 35th International Conference on Neural Infor- mation Processing Systems, Red Hook, NY , USA, 2021. Cur- ran Associates Inc. 1, 2, 6

work page 2021
[21]

Tl;dw? summarizing instructional videos with task relevance and cross-modal saliency

Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, and Cordelia Schmid. Tl;dw? summarizing instructional videos with task relevance and cross-modal saliency. InComputer Vision – ECCV 2022, pages 540–557, Cham, 2022. Springer Nature Switzerland. 2

work page 2022
[22]

Rethinking the evaluation of video summaries

Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkil¨a. Rethinking the evaluation of video summaries. In 2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 7588–7596, 2019. 6

work page 2019
[23]

Multimodal abstractive summarization for how2 videos

Shruti Palaskar, Jind ˇrich Libovick ´y, Spandana Gella, and Florian Metze. Multimodal abstractive summarization for how2 videos. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6587– 6596, Florence, Italy, 2019. Association for Computational Linguistics. 2

work page 2019
[24]

MM- Sum: A Dataset for Multimodal Summarization and Thumb- nail Generation of Videos

Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, Jian- feng Wang, Ding Zhao, Bo Li, and Lijuan Wang. MM- Sum: A Dataset for Multimodal Summarization and Thumb- nail Generation of Videos . In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21909–21921, Los Alamitos...

work page 2024
[25]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InProceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023. 5

work page 2023
[26]

Hierarchical multimodal attention for deep video summa- rization

Melissa Sanabria, Fr ´ed´eric Precioso, and Thomas Menguy. Hierarchical multimodal attention for deep video summa- rization. In2020 25th International Conference on Pattern Recognition (ICPR), pages 7977–7984, 2021. 2

work page 2021
[27]

Query- focused extractive video summarization

Aidean Sharghi, Boqing Gong, and Mubarak Shah. Query- focused extractive video summarization. InComputer Vision – ECCV 2016, pages 3–19, Cham, 2016. Springer Interna- tional Publishing. 1, 2

work page 2016
[28]

Laurel, and Boqing Gong

Aidean Sharghi, Jacob S. Laurel, and Boqing Gong. Query- focused video summarization: Dataset, evaluation, and a memory network based approach. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2127–2136, 2017. 1, 2

work page 2017
[29]

CSTA: CNN- based Spatiotemporal Attention for Video Summarization

Jaewon Son, Jaehun Park, and Kwangsu Kim. CSTA: CNN- based Spatiotemporal Attention for Video Summarization . In2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18847–18856, Los Alami- tos, CA, USA, 2024. IEEE Computer Society. 6

work page 2024
[30]

Tvsum: Summarizing web videos using titles

Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejan- dro Jaimes. Tvsum: Summarizing web videos using titles. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5179–5187, 2015. 2

work page 2015
[31]

Jinhwan Sul, Jihoon Han, and Joonseok Lee. Mr. hisum: a large-scale dataset for video highlight detection and summa- rization. InProceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2023. Curran Associates Inc. 3, 5, 6

work page 2023
[32]

NLLB Team, Marta R. Costa-juss `a, James Cross, Onur C ¸ elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Young- blood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonza- lez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, D...

work page 2022
[33]

Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language clas- sifier.https://github.com/snakers4/silero- vad, 2024

Silero Team. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language clas- sifier.https://github.com/snakers4/silero- vad, 2024. 5

work page 2024
[34]

Query-adaptive video summarization via quality-aware relevance estimation

Arun Balajee Vasudevan, Michael Gygli, Anna V olokitin, and Luc Van Gool. Query-adaptive video summarization via quality-aware relevance estimation. InProceedings of the 25th ACM International Conference on Multimedia, page 582–590, New York, NY , USA, 2017. Association for Com- puting Machinery. 1, 2

work page 2017
[35]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

work page 2017
[36]

Video summarization via semantic attended networks

Huawei Wei, Bingbing Ni, Yichao Yan, Huanyu Yu, and Xi- aokang Yang. Video summarization via semantic attended networks. InProceedings of the Thirty-Second AAAI Confer- ence on Artificial Intelligence and Thirtieth Innovative Ap- plications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial In- telligence...

work page 2018
[37]

Query-biased self-attentive network for query- focused video summarization.IEEE Transactions on Image Processing, 29:5889–5899, 2020

Shuwen Xiao, Zhou Zhao, Zijian Zhang, Ziyu Guan, and Deng Cai. Query-biased self-attentive network for query- focused video summarization.IEEE Transactions on Image Processing, 29:5889–5899, 2020. 2

work page 2020
[38]

Convolutional hierarchical attention network for query-focused video summarization.Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):12426– 12433, 2020

Shuwen Xiao, Zhou Zhao, Zijian Zhang, Xiaohui Yan, and Min Yang. Convolutional hierarchical attention network for query-focused video summarization.Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):12426– 12433, 2020. 2

work page 2020
[39]

VideoSET: Video Summary Evaluation through Text

Serena Yeung, Alireza Fathi, and Li Fei-Fei. Videoset: Video summary evaluation through text.ArXiv, abs/1406.5824,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Kampffmeyer, Xiaodan Liang, Min Tan, and Eric P

Yujia Zhang, Michael C. Kampffmeyer, Xiaodan Liang, Min Tan, and Eric P. Xing. Query-Conditioned Three-Player Ad- versarial Network for Video Summarization. InProceedings of the 2018 British Machine Vision Conf. (BMVC), 2018. 2

work page 2018
[41]

Deep semantic and attentive network for unsupervised video summarization.ACM Trans

Sheng-Hua Zhong, Jingxu Lin, Jianglin Lu, Ahmed Fares, and Tongwei Ren. Deep semantic and attentive network for unsupervised video summarization.ACM Trans. Multimedia Comput. Commun. Appl., 18(2), 2022. 2

work page 2022

[1] [1]

YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Nat- sev, George Toderici, Balakrishnan Varadarajan, and Sud- heendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark.CoRR, abs/1609.08675,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Metsai, Vasileios Mezaris, and Ioannis Patras

Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, and Ioannis Patras. Video sum- marization using deep neural networks: A survey.Proceed- ings of the IEEE, 109(11):1838–1863, 2021. 3

work page 2021

[3] [3]

Combining global and local at- tention with positional encoding for video summarization

Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, and Ioannis Patras. Combining global and local at- tention with positional encoding for video summarization. In 2021 IEEE International Symposium on Multimedia (ISM), pages 226–234, 2021. 6

work page 2021

[4] [4]

Scaling Up Video Summarization Pretraining with Large Language Models

Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heil- bron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, and Joon Son Chung. Scaling Up Video Summarization Pretraining with Large Language Models . In2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 8332–8341, Los Alamitos, CA, USA, 2024. IEEE Computer...

work page 2024

[5] [5]

Lever- aging semantic saliency maps for query-specific video sum- marization.Multimedia Tools Appl., 81(12):17457–17482,

Kemal Cizmeciler, Erkut Erdem, and Aykut Erdem. Lever- aging semantic saliency maps for query-specific video sum- marization.Multimedia Tools Appl., 81(12):17457–17482,

work page

[6] [6]

MM-A VS: A full- scale dataset for multi-modal summarization

Xiyan Fu, Jun Wang, and Zhenglu Yang. MM-A VS: A full- scale dataset for multi-modal summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 5922–5926, Online, 2021. Asso- ciation for Computational Linguistics. 2

work page 2021

[7] [7]

Creating summaries from user videos

Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. Creating summaries from user videos. InComputer Vision – ECCV 2014, pages 505–520, Cham,

work page 2014

[8] [8]

Springer International Publishing. 2

work page

[9] [9]

Align and Attend: Multi- modal Summarization with Dual Contrastive Losses

Bo He, Jun Wang, Jielin Qiu, Trung Bui, Abhinav Shri- vastava, and Zhaowen Wang. Align and Attend: Multi- modal Summarization with Dual Contrastive Losses . In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 14867–14878, Los Alami- tos, CA, USA, 2023. IEEE Computer Society. 2, 6

work page 2023

[10] [10]

Query-controllable video summarization

Jia-Hong Huang and Marcel Worring. Query-controllable video summarization. InProceedings of the 2020 Interna- tional Conference on Multimedia Retrieval, page 242–250, New York, NY , USA, 2020. Association for Computing Ma- chinery. 1

work page 2020

[11] [11]

Query-based video summarization with pseudo label supervision

Jia-Hong Huang, Luka Murn, Marta Mrak, and Marcel Wor- ring. Query-based video summarization with pseudo label supervision. In2023 IEEE International Conference on Im- age Processing (ICIP), pages 1430–1434, 2023. 2

work page 2023

[12] [12]

Hierarchical variational network for user-diversified & query-focused video summarization

Pin Jiang and Yahong Han. Hierarchical variational network for user-diversified & query-focused video summarization. InProceedings of the 2019 on International Conference on Multimedia Retrieval, page 202–206, New York, NY , USA,

work page 2019

[13] [13]

Association for Computing Machinery. 2

work page

[14] [14]

The treatment of ties in ranking prob- lems.Biometrika, 33(3):239–251, 1945

Maurice G Kendall. The treatment of ties in ranking prob- lems.Biometrika, 33(3):239–251, 1945. 6

work page 1945

[15] [15]

Crc Press,

Stephen Kokoska and Daniel Zwillinger.CRC standard probability and statistics tables and formulae. Crc Press,

work page

[16] [16]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In2017 IEEE International Conference on Computer Vision (ICCV), pages 706–715, 2017. 2, 3

work page 2017

[17] [17]

Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capa- bilities in the wild, 2024. 5

work page 2024

[18] [18]

Videoxum: Cross- modal visual and textural summarization of videos.IEEE Transactions on Multimedia, 26:5548–5560, 2024

Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, and Jiebo Luo. Videoxum: Cross- modal visual and textural summarization of videos.IEEE Transactions on Multimedia, 26:5548–5560, 2024. 3, 5, 6

work page 2024

[19] [19]

Sd-vsum: A method and dataset for script-driven video summarization, 2025

Manolis Mylonas, Evlampios Apostolidis, and Vasileios Mezaris. Sd-vsum: A method and dataset for script-driven video summarization, 2025. 1, 2, 3, 6, 7

work page 2025

[20] [20]

Clip-it! language-guided video summarization

Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. Clip-it! language-guided video summarization. InProceed- ings of the 35th International Conference on Neural Infor- mation Processing Systems, Red Hook, NY , USA, 2021. Cur- ran Associates Inc. 1, 2, 6

work page 2021

[21] [21]

Tl;dw? summarizing instructional videos with task relevance and cross-modal saliency

Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, and Cordelia Schmid. Tl;dw? summarizing instructional videos with task relevance and cross-modal saliency. InComputer Vision – ECCV 2022, pages 540–557, Cham, 2022. Springer Nature Switzerland. 2

work page 2022

[22] [22]

Rethinking the evaluation of video summaries

Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkil¨a. Rethinking the evaluation of video summaries. In 2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 7588–7596, 2019. 6

work page 2019

[23] [23]

Multimodal abstractive summarization for how2 videos

Shruti Palaskar, Jind ˇrich Libovick ´y, Spandana Gella, and Florian Metze. Multimodal abstractive summarization for how2 videos. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6587– 6596, Florence, Italy, 2019. Association for Computational Linguistics. 2

work page 2019

[24] [24]

MM- Sum: A Dataset for Multimodal Summarization and Thumb- nail Generation of Videos

Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, Jian- feng Wang, Ding Zhao, Bo Li, and Lijuan Wang. MM- Sum: A Dataset for Multimodal Summarization and Thumb- nail Generation of Videos . In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21909–21921, Los Alamitos...

work page 2024

[25] [25]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InProceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023. 5

work page 2023

[26] [26]

Hierarchical multimodal attention for deep video summa- rization

Melissa Sanabria, Fr ´ed´eric Precioso, and Thomas Menguy. Hierarchical multimodal attention for deep video summa- rization. In2020 25th International Conference on Pattern Recognition (ICPR), pages 7977–7984, 2021. 2

work page 2021

[27] [27]

Query- focused extractive video summarization

Aidean Sharghi, Boqing Gong, and Mubarak Shah. Query- focused extractive video summarization. InComputer Vision – ECCV 2016, pages 3–19, Cham, 2016. Springer Interna- tional Publishing. 1, 2

work page 2016

[28] [28]

Laurel, and Boqing Gong

Aidean Sharghi, Jacob S. Laurel, and Boqing Gong. Query- focused video summarization: Dataset, evaluation, and a memory network based approach. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2127–2136, 2017. 1, 2

work page 2017

[29] [29]

CSTA: CNN- based Spatiotemporal Attention for Video Summarization

Jaewon Son, Jaehun Park, and Kwangsu Kim. CSTA: CNN- based Spatiotemporal Attention for Video Summarization . In2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18847–18856, Los Alami- tos, CA, USA, 2024. IEEE Computer Society. 6

work page 2024

[30] [30]

Tvsum: Summarizing web videos using titles

Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejan- dro Jaimes. Tvsum: Summarizing web videos using titles. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5179–5187, 2015. 2

work page 2015

[31] [31]

Jinhwan Sul, Jihoon Han, and Joonseok Lee. Mr. hisum: a large-scale dataset for video highlight detection and summa- rization. InProceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2023. Curran Associates Inc. 3, 5, 6

work page 2023

[32] [32]

NLLB Team, Marta R. Costa-juss `a, James Cross, Onur C ¸ elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Young- blood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonza- lez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, D...

work page 2022

[33] [33]

Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language clas- sifier.https://github.com/snakers4/silero- vad, 2024

Silero Team. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language clas- sifier.https://github.com/snakers4/silero- vad, 2024. 5

work page 2024

[34] [34]

Query-adaptive video summarization via quality-aware relevance estimation

Arun Balajee Vasudevan, Michael Gygli, Anna V olokitin, and Luc Van Gool. Query-adaptive video summarization via quality-aware relevance estimation. InProceedings of the 25th ACM International Conference on Multimedia, page 582–590, New York, NY , USA, 2017. Association for Com- puting Machinery. 1, 2

work page 2017

[35] [35]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

work page 2017

[36] [36]

Video summarization via semantic attended networks

Huawei Wei, Bingbing Ni, Yichao Yan, Huanyu Yu, and Xi- aokang Yang. Video summarization via semantic attended networks. InProceedings of the Thirty-Second AAAI Confer- ence on Artificial Intelligence and Thirtieth Innovative Ap- plications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial In- telligence...

work page 2018

[37] [37]

Query-biased self-attentive network for query- focused video summarization.IEEE Transactions on Image Processing, 29:5889–5899, 2020

Shuwen Xiao, Zhou Zhao, Zijian Zhang, Ziyu Guan, and Deng Cai. Query-biased self-attentive network for query- focused video summarization.IEEE Transactions on Image Processing, 29:5889–5899, 2020. 2

work page 2020

[38] [38]

Convolutional hierarchical attention network for query-focused video summarization.Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):12426– 12433, 2020

Shuwen Xiao, Zhou Zhao, Zijian Zhang, Xiaohui Yan, and Min Yang. Convolutional hierarchical attention network for query-focused video summarization.Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):12426– 12433, 2020. 2

work page 2020

[39] [39]

VideoSET: Video Summary Evaluation through Text

Serena Yeung, Alireza Fathi, and Li Fei-Fei. Videoset: Video summary evaluation through text.ArXiv, abs/1406.5824,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Kampffmeyer, Xiaodan Liang, Min Tan, and Eric P

Yujia Zhang, Michael C. Kampffmeyer, Xiaodan Liang, Min Tan, and Eric P. Xing. Query-Conditioned Three-Player Ad- versarial Network for Video Summarization. InProceedings of the 2018 British Machine Vision Conf. (BMVC), 2018. 2

work page 2018

[41] [41]

Deep semantic and attentive network for unsupervised video summarization.ACM Trans

Sheng-Hua Zhong, Jingxu Lin, Jianglin Lu, Ahmed Fares, and Tongwei Ren. Deep semantic and attentive network for unsupervised video summarization.ACM Trans. Multimedia Comput. Commun. Appl., 18(2), 2022. 2

work page 2022