pith. machine review for the scientific record. sign in

arxiv: 2605.03276 · v2 · submitted 2026-05-05 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords video editing benchmarklarge multimodal modelsediting techniquesmultimodal reasoningoperation simulationvideo understandingcinematic techniques
0
0 comments X

The pith

A new benchmark reveals large gaps in how well large multimodal models handle real-world video editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates VEBench to test large multimodal models on real video editing skills. The benchmark uses over 3,900 edited videos and more than 3,000 question-answer pairs to check two skills: spotting seven different editing techniques and simulating edits by picking the right clips at the right times from several options. Tests on many models show they do much worse than people, which matters because it points out what AI needs to learn to help with actual video production work. The results suggest models must connect basic video understanding to practical decision-making for editing tasks.

Core claim

VEBENCH is introduced as the first benchmark to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. It includes 3.9K high-quality edited videos totaling over 257 hours and 3,080 human-verified QA pairs constructed via a three-round human-AI collaborative pipeline. The two tasks are Video Editing Technique Recognition, which tests identification of seven editing techniques via multimodal cues, and Video Editing Operation Simulation, which requires selecting and temporally localizing relevant clips from multiple candidates to model editing workflows. Experiments across proprietary and open-source models demonstrate a large gap to human-

What carries the argument

VEBench benchmark with its two complementary QA tasks: technique recognition using multimodal cues for seven editing methods and operation simulation that requires selecting and temporally localizing clips from multiple candidate videos to model real editing workflows.

Load-bearing premise

The three-round human-AI collaborative annotation pipeline produces QA pairs that accurately reflect real-world video editing demands without introducing systematic bias.

What would settle it

A model achieving or exceeding human performance levels across all 3,080 QA pairs on both the technique recognition and operation simulation tasks would disprove the claimed performance gap.

Figures

Figures reproduced from arXiv: 2605.03276 by Andong Deng, Chen Chen, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Longyin Wen, Sijie Zhu, Wen Zhong, Zhenfang Chen.

Figure 1
Figure 1. Figure 1: Examples of the two tasks in VEBENCH. Video Editing Technique Recognition (top) detects the editing technique at a specified position; Video Editing Operation Simulation (bottom) first selects the most suitable footage from a candidate list and then localizes the proper segment based on a reference video. Correct answers are indicated in green. Abstract Real-world video editing demands not only expert know… view at source ↗
Figure 2
Figure 2. Figure 2: Video Editing Techniques Illustrations. The red arrows view at source ↗
Figure 3
Figure 3. Figure 3: Annotation pipelines of the two VEBENCH subtasks. (a) TechRec: GPT-4o provides YouTube video search queries and Gemini￾2.5-Pro filter video candidates; human annotators inspect the videos and finalize the editing technique labels. (b) OpSim: Gemini-2.5-Pro analyzes the selected high-quality edited videos and generate metadata; human annotators perform A/B Roll paring and refine timestamps; human annotators… view at source ↗
Figure 4
Figure 4. Figure 4: Distributions of video duration and editing techniques in view at source ↗
Figure 5
Figure 5. Figure 5: Examples of two different visual prompts in the stitched view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative example of Gemini-2.5-Pro [7], Qwen3-VL-8B-Instruct [28], and InternVL3-8B [42]. 0 500 1000 1500 2000 2500 3000 3500 Start End Gemini-2.5-Pro 0 5 10 15 20 25 30 Start End Qwen3-VL-8B-Instruct 0 20 40 60 80 100 120 140 160 Timestamp (s) Start End InternVL3-8B view at source ↗
Figure 7
Figure 7. Figure 7: Predicted timestamp distributions in OpSim for Gemini view at source ↗
Figure 8
Figure 8. Figure 8: The prompt (P1) used for video analysis in VEB view at source ↗
Figure 9
Figure 9. Figure 9: The prompt (P2) used for video analysis in VEB view at source ↗
Figure 10
Figure 10. Figure 10: Operation simulation metadata example. 16 view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative example of Gemini-2.5-Pro [7], Qwen3-VL-8B-Instruct [28], and InternVL3-8B [42]. 18 view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative example of Gemini-2.5-Pro [7], Qwen3-VL-8B-Instruct [28], and InternVL3-8B [42]. 19 view at source ↗
read the original abstract

Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models' ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VEBENCH, a benchmark comprising 3.9K high-quality edited videos (257+ hours) and 3,080 human-verified QA pairs, constructed via a three-round human-AI collaborative annotation pipeline. It defines two tasks—Video Editing Technique Recognition (identifying 7 techniques via multimodal cues) and Video Editing Operation Simulation (selecting and temporally localizing clips from candidates)—and reports extensive experiments on proprietary and open-source LMMs showing a substantial performance gap relative to human-level editing cognition.

Significance. If the benchmark's annotations prove free of systematic bias, VEBENCH would be a valuable contribution as the first large-scale resource specifically targeting the intersection of video editing knowledge and operational multimodal reasoning. The scale, dual-task design, and human verification pipeline are clear strengths that could catalyze progress on creative video workflows beyond standard understanding benchmarks.

major comments (2)
  1. [Section 3] Section 3 (Dataset Construction and Annotation Pipeline): The three-round human-AI pipeline is presented as ensuring 'precise temporal labeling and semantic consistency,' yet the manuscript provides no inter-annotator agreement statistics, error-rate analysis, or ablation comparing AI-assisted vs. human-only annotations. This is load-bearing for the central claim of a genuine gap, as unquantified bias in technique labeling or clip selection could artifactually inflate model shortfalls.
  2. [Section 4.2] Section 4.2 (Video Editing Operation Simulation Results): The reported performance gap is stated without accompanying statistical significance tests, confidence intervals, or per-technique breakdowns that would confirm the gap is robust rather than driven by a subset of the 7 techniques or particular video categories.
minor comments (2)
  1. [Abstract and Section 3] The abstract and introduction refer to '3.9K high-quality edited videos' while the methods section should explicitly reconcile this with the 3,080 QA pairs (e.g., average pairs per video) for clarity.
  2. [Figure 1] Figure 1 (benchmark overview) would benefit from explicit callouts to the two QA task formats and the 7 editing techniques to improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on VEBench. The comments highlight important aspects of annotation reliability and experimental rigor. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Dataset Construction and Annotation Pipeline): The three-round human-AI pipeline is presented as ensuring 'precise temporal labeling and semantic consistency,' yet the manuscript provides no inter-annotator agreement statistics, error-rate analysis, or ablation comparing AI-assisted vs. human-only annotations. This is load-bearing for the central claim of a genuine gap, as unquantified bias in technique labeling or clip selection could artifactually inflate model shortfalls.

    Authors: We acknowledge that the current manuscript does not report inter-annotator agreement (IAA) statistics, a dedicated error-rate analysis, or an ablation of AI-assisted versus human-only annotation. The three-round pipeline relies on AI for initial proposals followed by multi-round human verification to enforce consistency, but quantitative validation of this process was omitted. In revision, we will add IAA metrics (e.g., Fleiss' kappa) computed on a sampled subset of the 3,080 QA pairs from the final human verification round. We will also include an error analysis summarizing the most frequent discrepancy types resolved by humans. A full ablation study comparing AI-only versus the full pipeline is not feasible within current resources, but we will expand the pipeline description with qualitative evidence of bias reduction. These changes will better support the annotation quality underlying the reported model-human gap. revision: partial

  2. Referee: [Section 4.2] Section 4.2 (Video Editing Operation Simulation Results): The reported performance gap is stated without accompanying statistical significance tests, confidence intervals, or per-technique breakdowns that would confirm the gap is robust rather than driven by a subset of the 7 techniques or particular video categories.

    Authors: We agree that the absence of statistical tests and breakdowns limits the strength of the claims in Section 4.2. In the revised manuscript, we will add bootstrap-based 95% confidence intervals and paired statistical significance tests (with multiple-comparison correction) for all model-versus-human comparisons on both tasks. We will also include per-technique performance tables and breakdowns by video category (e.g., duration, editing complexity) to demonstrate that the performance gap holds consistently across the seven techniques rather than being driven by outliers. These additions will be integrated into the results section and associated tables/figures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent human-verified labels

full rationale

The paper presents VEBENCH as a new dataset and evaluation suite constructed via a three-round human-AI annotation pipeline, followed by direct measurement of LMM performance on two QA tasks against those human-verified labels. No equations, fitted parameters, derivations, or self-referential definitions appear in the provided text. Reported gaps are observed performance differences, not quantities defined in terms of the models' own outputs or prior self-citations. The annotation pipeline is a methodological choice whose bias properties are external to the reported numbers; it does not reduce the benchmark results to tautology or self-definition. This is a standard empirical benchmark paper whose central claims rest on held-out measurements rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the chosen 7 editing techniques and the human-verified QA pairs faithfully represent real-world video editing cognition; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The 7 editing techniques cover the key multimodal cues needed for realistic video editing assessment.
    Stated as the basis for the Video Editing Technique Recognition task in the abstract.
  • domain assumption The three-round human-AI annotation pipeline produces temporally precise and semantically consistent labels.
    Invoked to justify the quality of the 3,080 QA pairs.

pith-pipeline@v0.9.0 · 5571 in / 1439 out tokens · 26038 ms · 2026-05-12T01:40:15.404352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 9 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 3

  3. [3]

    Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 2

  4. [4]

    Routledge, 2018

    Ken Dancyger.The technique of film and video editing: his- tory, theory, and practice. Routledge, 2018. 2, 3

  5. [5]

    Motion-grounded video reasoning: Understanding and perceiving motion at pixel level

    Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, and Chen Chen. Motion-grounded video reasoning: Understanding and perceiving motion at pixel level. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8625–8636, 2025. 2

  6. [6]

    arXiv preprint arXiv:2510.08559 , year=

    Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, and Xiaohan Wang. Scivideobench: Benchmarking scientific video reasoning in large multimodal models.arXiv preprint arXiv:2510.08559, 2025. 3, 6

  7. [7]

    Gemini models: Gemini 2.5 pro,

    Google AI for Developers. Gemini models: Gemini 2.5 pro,

  8. [8]

    2, 4, 6, 7, 8, 12, 17, 18, 19

    Accessed May 15, 2025. 2, 4, 6, 7, 8, 12, 17, 18, 19

  9. [9]

    Rout- ledge, 2018

    Michael Frierson.Film and Video Editing Theory. Rout- ledge, 2018. 2, 10

  10. [10]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 2, 3, 6

  11. [11]

    Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yun- hang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Het- ing Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025. 17, 18

  12. [12]

    Edit3k: Universal representa- tion learning for video editing components.arXiv preprint arXiv:2403.16048, 2024

    Xin Gu, Libo Zhang, Fan Chen, Longyin Wen, Yufei Wang, Tiejian Luo, and Sijie Zhu. Edit3k: Universal representa- tion learning for video editing components.arXiv preprint arXiv:2403.16048, 2024. 2, 3

  13. [13]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015. 2

  14. [14]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 3

  15. [15]

    B-script: Transcript-based b- roll video editing with recommendations

    Bernd Huber, Hijung Valentina Shin, Bryan Russell, Oliver Wang, and Gautham J Mysore. B-script: Transcript-based b- roll video editing with recommendations. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1–11, 2019. 2

  16. [16]

    Routledge, 2016

    Kirsten Johnson and Jodi Radosh.Shoot, edit, share: Video production for mass media, marketing, advertising, and pub- lic relations. Routledge, 2016. 2

  17. [17]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

  18. [18]

    Veu-bench: Towards comprehensive under- standing of video editing

    Bozheng Li, Yongliang Wu, Yi Lu, Jiashuo Yu, Licheng Tang, Jiawang Cao, Wenqing Zhu, Yuyang Sun, Jay Wu, and Wenbo Zhu. Veu-bench: Towards comprehensive under- standing of video editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13671– 13680, 2025. 2, 3

  19. [19]

    Omnivideobench: Towards audio-visual understanding evaluation for omni mllms, 2025

    Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shi- hao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Din- gling Zhang, et al. Omnivideobench: Towards audio-visual understanding evaluation for omni mllms.arXiv preprint arXiv:2510.10689, 2025. 2

  20. [20]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 3

  21. [21]

    From representa- tion to reasoning: Towards both evidence and commonsense reasoning for video question-answering

    Jiangtong Li, Li Niu, and Liqing Zhang. From representa- tion to reasoning: Towards both evidence and commonsense reasoning for video question-answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  22. [22]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

  23. [23]

    Shotbench: Expert-level cinematic understanding in vision-language models,

    Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, et al. Shotbench: Expert-level cinematic understanding in vision-language models.arXiv preprint arXiv:2506.21356, 2025. 2, 3

  24. [24]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024. 2, 3

  25. [25]

    University of Chicago Press, 1991

    Christian Metz.Film language: A semiotics of the cinema. University of Chicago Press, 1991. 2

  26. [26]

    Gpt-4o: Openai’s newest multimodal model

    OpenAI. Gpt-4o: Openai’s newest multimodal model. https://openai.com/index/gpt-4o, 2024. Ac- cessed: 2025-05-09. 6, 7

  27. [27]

    Perazzi, J

    F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and 9 evaluation methodology for video object segmentation. In Computer Vision and Pattern Recognition, 2016. 2

  28. [28]

    Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.CoRR, abs/2504.07956, 2025

    Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation frame- work for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025. 2

  29. [29]

    Qwen3-vl.https : / / github

    Qwen Team. Qwen3-vl.https : / / github . com / QwenLM/Qwen3-VL, 2025. 2, 3, 6, 7, 8, 17, 18, 19

  30. [30]

    Sage Publica- tions, 2007

    Leonard Shyles.The art of video production. Sage Publica- tions, 2007. 2

  31. [31]

    Video-mmlu: A massive multi- discipline lecture understanding benchmark

    Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, and Gaoang Wang. Video-mmlu: A massive multi- discipline lecture understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 6099–6113, 2025. 3

  32. [32]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 2

  33. [33]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 7

  34. [34]

    Research on the application of nonlinear edit- ing technology in vlog production

    Hui Wang. Research on the application of nonlinear edit- ing technology in vlog production. In2021 International Conference on Big Data Analysis and Computer Science (BDACS), pages 246–249. IEEE, 2021. 2

  35. [35]

    Cinetechbench: A benchmark for cine- matographic technique understanding and generation.arXiv preprint arXiv:2505.15145, 2025

    Xinran Wang, Songyu Xu, Xiangxuan Shan, Yuxuan Zhang, Muxi Diao, Xueyan Duan, Yanhua Huang, Kongming Liang, and Zhanyu Ma. Cinetechbench: A benchmark for cine- matographic technique understanding and generation.arXiv preprint arXiv:2505.15145, 2025. 2

  36. [36]

    Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024. 2, 3, 6

  37. [37]

    Beyond raw videos: Understanding edited videos with large multimodal model

    Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, and Longyin Wen. Beyond raw videos: Understanding edited videos with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 503–512, 2025. 2, 3

  38. [38]

    Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018

    Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018. 2

  39. [39]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 7

  40. [40]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Li- dong Bing, and Deli Zhao. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 17, 18

  41. [41]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 3

  42. [42]

    Towards automatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InProceedings of the AAAI conference on artificial intelligence, 2018. 2

  43. [43]

    When does the{CUT TYPE}occur in the video?

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Con- ghui He, Botian Shi, Xingchen...

  44. [44]

    high", "medium

    **Editing Quality Assessment ** - Evaluate the overall editing quality of the interview video. - Rate it as one of: "high", "medium", or "low". - Consider transitions, pacing, timing to music/speech, and use of overlays. - Provide a detailed justification in "editing_justification"

  45. [45]

    Yes"/"No

    **Dataset Suitability Decision ** - Decide whether this video is suitable for inclusion in a multimodal dataset focused on edited cinematic clips. - Must include meaningful edits, cinematic framing (16:9), and exclude shorts or raw footage. - Output "Yes"/"No" in "suitable_for_dataset" and explain in "suitability_justification"

  46. [46]

    **Interviewee Identification ** - Identify or infer the interviewee’s name from context or transcript

  47. [47]

    timestamp_start

    **Timestamp-Based Scene Breakdown ** For each scene, return: - "timestamp_start": e.g., "0:24" - "timestamp_end": e.g., "0:31" - "scene_description": visual content summary - "source_movie": name or None - "type": "A-Roll", "B-Roll", "Intro Graphic", etc. - "b_roll_purpose": narrative function (if B-Roll) - "broll_uniqueness": "Highly unique" or "Highly r...