DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

Jialiang Chen; Jiayu Chen; Ke Li; Maoliang Li; Shaoqi Wang; Xiang Chen; Zihao Zheng

arxiv: 2604.04875 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI· cs.MM

DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

Ke Li , Maoliang Li , Jialiang Chen , Jiayu Chen , Zihao Zheng , Shaoqi Wang , Xiang Chen This is my paper

Pith reviewed 2026-05-10 19:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords video mashup creationmulti-agent frameworkvideo editingmultimodal coherencyhierarchical planningintent-guided editingbenchmarking

0 comments

The pith

The DIRECT framework solves video mashup creation by using hierarchical agents to ensure multimodal coherency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video mashup creation can be treated as a Multimodal Coherency Satisfaction Problem and solved through a three-stage agent pipeline that mimics professional film production. Current automated approaches produce sequences with abrupt visual cuts and poor music alignment because they lack coordinated planning across global structure, editing intent, and fine details. By assigning separate agents to source-aware anchoring, adaptive intent generation, and optimized shot sequencing, the method produces smoother results. This matters for anyone who wants to repurpose existing video clips into engaging content without manual editing labor.

Core claim

We formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment.

What carries the argument

The cascade of Screenwriter, Director, and Editor agents that progressively handle global structure, intent guidance, and fine-grained editing to satisfy multimodal coherency.

If this is right

The approach produces higher scores on objective metrics for visual continuity and auditory alignment.
Generated mashups receive better ratings in human subjective evaluations than prior methods.
The introduced Mashup-Bench benchmark supplies standardized metrics and test cases for future video editing research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same role decomposition could be tested on related tasks such as automatic video summarization or trailer generation.
If the agents can operate with low latency, the pipeline might support interactive or real-time mashup tools.
The framework invites comparison with single-model end-to-end editing systems to isolate the benefit of explicit hierarchical intent passing.

Load-bearing premise

Decomposing the task into Screenwriter, Director, and Editor agents accurately simulates a professional production pipeline and achieves cross-level multimodal orchestration for coherent mashups.

What would settle it

A direct comparison experiment in which human raters score DIRECT mashups as equally or less coherent in visual transitions and audio alignment than the strongest baseline methods would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.04875 by Jialiang Chen, Jiayu Chen, Ke Li, Maoliang Li, Shaoqi Wang, Xiang Chen, Zihao Zheng.

**Figure 1.** Figure 1: Hierarchical constraints of multimodal coherency in video mashup. (1) Global Structural Alignment of narrative [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of DIRECT. We decompose video mashup creation into three collaborative modules: the Screenwriter [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the hierarchical planning workflow in DIRECT. The Screenwriter leverages multimodal source [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Intent-Guided Shot Sequence Editing. The Editor [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison of Low-Level Coherency. While baseline (top row) only ensures semantic relevance, our [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Case study of Footage Summarization. It decon [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIRECT introduces a three-agent hierarchy and a new Mashup-Bench for video mashups, but the real test is whether the reported gains hold under standard controls.

read the letter

The paper's core move is to treat video mashup creation as a Multimodal Coherency Satisfaction Problem and split it across Screenwriter, Director, and Editor agents that handle global structure, intent, and fine-grained edits in sequence. They also release Mashup-Bench with metrics focused on visual continuity and audio alignment. That decomposition is straightforward and gives the work a clear organizing principle that prior editing pipelines often lack. Releasing code is helpful for anyone who wants to build on the benchmark or test the pipeline themselves. The claim of outperforming baselines on both objective scores and human ratings is the part that matters most for impact, and if the controls are tight it could serve as a useful reference point in AI video editing. The main soft spot is the evaluation. The abstract states clear wins, yet without seeing the exact baseline implementations, dataset splits, or how the human study was blinded and powered, it is hard to know whether the improvements come from the hierarchy or from other factors like prompt engineering or metric tuning. The agent roles map nicely to production stages on paper, but it is not obvious how much the cascade actually reduces the cross-modal misalignment problems compared with a single well-tuned model. This work is for researchers who build tools for automated or semi-automated video editing and for people tracking multi-agent methods in creative domains. Readers who care about benchmarks in multimedia will get immediate value from Mashup-Bench even if they skip the agent details. It deserves a serious referee because the benchmark and formulation are new enough to warrant external checks on the experiments and ablations. I would send it to review with a note to strengthen the experimental section.

Referee Report

0 major / 2 minor

Summary. The paper claims to address the challenge of automated video mashup creation by formulating it as a Multimodal Coherency Satisfaction Problem (MMCSP). It proposes the DIRECT framework, which uses a hierarchical multi-agent system consisting of a Screenwriter agent for source-aware global structural anchoring, a Director agent for adaptive editing intent and guidance, and an Editor agent for intent-guided shot sequence editing with fine-grained optimization. Additionally, the authors introduce Mashup-Bench, a benchmark with metrics for visual continuity and auditory alignment, and demonstrate through experiments that DIRECT outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation.

Significance. If the empirical results are robust, this work could significantly advance the field of automated video editing by providing a structured, multi-level approach to achieving multimodal coherence in mashups. The introduction of a specialized benchmark is a valuable contribution that may facilitate standardized evaluation in future research on video composition and editing.

minor comments (2)

[Abstract] Abstract: Consider adding a brief mention of the specific objective metrics used (e.g., for visual continuity and auditory alignment) to strengthen the summary of results.
[Introduction] Introduction or §3: Ensure that the description of the three agents clearly distinguishes their roles to avoid any potential overlap in responsibilities.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and significance assessment of our work, as well as the recommendation for minor revision. The referee's description accurately reflects the MMCSP formulation, the three-level DIRECT framework, and the Mashup-Bench contribution.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new formulation of video mashup creation as MMCSP and a hierarchical multi-agent framework (Screenwriter-Director-Editor) plus the Mashup-Bench benchmark, with performance claims resting on external comparisons to baselines and human evaluations. No equations, fitted parameters, or derivations are present that reduce to self-inputs by construction; the argument structure is a standard empirical proposal of a novel pipeline without load-bearing self-citations or renamings of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the domain assumption that video mashup creation is best solved by formulating it as MMCSP and decomposing it into the three named agent roles; no numerical free parameters are mentioned. The three agents are invented entities introduced to structure the pipeline.

axioms (1)

domain assumption Video mashup creation can be formulated as a Multimodal Coherency Satisfaction Problem (MMCSP) that requires cross-level orchestration across semantic, visual, and auditory dimensions.
This formulation is introduced in the abstract to justify the hierarchical multi-agent approach.

invented entities (3)

Screenwriter agent no independent evidence
purpose: Source-aware global structural anchoring
New agent role introduced to handle high-level planning in the framework.
Director agent no independent evidence
purpose: Instantiating adaptive editing intent and guidance
New agent role introduced to bridge planning and execution.
Editor agent no independent evidence
purpose: Intent-guided shot sequence editing with fine-grained optimization
New agent role introduced for low-level editing decisions.

pith-pipeline@v0.9.0 · 5508 in / 1473 out tokens · 67025 ms · 2026-05-10T19:41:14.846808+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 1 internal anchor

[1]

Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, and Bernard Ghanem. 2024. Towards automated movie trailer generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7445–7454

work page 2024
[2]

Brandon Castellano. [n. d.]. PySceneDetect: Python and OpenCV-based scene cut/transition detection program and library. https://github.com/Breakthrough/ PySceneDetect

work page
[3]

Boris Chen, Amir Ziai, Rebecca S Tucker, and Yuchen Xie. 2023. Match cutting: Finding cuts with smooth visual transitions. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2115–2125

work page 2023
[4]

Yaosen Chen, Wei Wang, Tianheng Zheng, Xuming Wen, Han Yang, and Yanru Zhang. 2025. ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing.arXiv preprint arXiv:2511.02505(2025)

work page arXiv 2025
[5]

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24185–24198

work page 2024
[6]

Zihan Ding, Xinyi Wang, Junlong Chen, Per Ola Kristensson, and Junxiao Shen

work page
[7]

Prompt-Driven Agentic Video Editing System: Autonomous Compre- hension of Long-Form, Story-Driven Media.arXiv preprint arXiv:2509.16811 (2025)

work page arXiv 2025
[8]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15180–15190

work page 2023
[9]

HKUDS. 2025. VideoAgent: All-in-One Agentic Framework for Video Under- standing, Editing, and Remaking. GitHub Repository. https://github.com/ HKUDS/VideoAgent Accessed: 2026-02-05

work page 2025
[10]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

work page
[11]

InThe twelfth international conference on learning representations

MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations

work page
[12]

Hang Hua, Yunlong Tang, Chenliang Xu, and Jiebo Luo. 2025. V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 3599–3607

work page 2025
[13]

Taejun Kim and Juhan Nam. 2023. All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio. In2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (W ASPAA). IEEE, 1–5

work page 2023
[14]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

work page 2023
[15]

Dawon Lee, Jung Eun Yoo, Kyungmin Cho, Bumki Kim, Gyeonghun Im, and Junyong Noh. 2022. PopStage: The Generation of Stage Cross-Editing Video based on Spatio-Temporal Matching.ACM Transactions on Graphics (TOG)41, 6 (2022), 1–13

work page 2022
[16]

Zicheng Liao, Yizhou Yu, Bingchen Gong, and Lechao Cheng. 2015. Audeosynth: music-driven video montage.ACM Transactions on Graphics (TOG)34, 4 (2015), 1–10

work page 2015
[17]

Wu-Qin Liu, Min-Xuan Lin, Hai-Bin Huang, Chong-Yang Ma, Yu Song, Wei-Ming Dong, and Chang-Sheng Xu. 2023. Emotion-aware music driven movie montage. Journal of Computer Science and Technology38, 3 (2023), 540–553

work page 2023
[18]

Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. 2024. Videostudio: Generating consistent-content and multi-scene videos. InEuropean Conference on Computer Vision. Springer, 468–485

work page 2024
[19]

Chen-Yi Lu, Md Mehrab Tanjim, Ishita Dasgupta, Somdeb Sarkhel, Gang Wu, Saayan Mitra, and Somali Chaterji. 2025. Skald: Learning-based shot assembly for coherent multi-shot video creation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17859–17868

work page 2025
[20]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing508 (2022), 293–304

work page 2022
[21]

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. 2022. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM international conference on multimedia. 638–647

work page 2022
[22]

2001.In the Blink of an Eye: A Perspective on Film Editing(2nd ed.)

Walter Murch. 2001.In the Blink of an Eye: A Perspective on Film Editing(2nd ed.). Silman-James Press, Los Angeles

work page 2001
[23]

Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. 2021. Clip-it! language-guided video summarization.Advances in neural information pro- cessing systems34 (2021), 13988–14000

work page 2021
[24]

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. 2020. U2-Net: Going deeper with nested U-structure for salient object detection.Pattern recognition106 (2020), 107404

work page 2020
[25]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page
[26]

In International conference on machine learning

Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

work page
[27]

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang

work page
[28]

Videorag: Retrieval-augmented generation with extreme long-context videos.arXiv preprint arXiv:2502.01549(2025)

work page arXiv 2025
[29]

Marcelo Sandoval-Castaneda, Bryan Russell, Josef Sivic, Gregory Shakhnarovich, and Fabian Caba Heilbron. 2025. EditDuet: A Multi-Agent System for Video Non-Linear Editing. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. 1–11

work page 2025
[30]

C Spearman. 2010. The proof and measurement of association between two things.International Journal of Epidemiology39, 5 (10 2010), 1137–1150. doi:10. 1093/ije/dyq191

work page 2010
[31]

Qwen Team. 2025. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision. Springer, 402–419

work page 2020
[33]

Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, and Raj Sodhi. 2024. Lave: Llm-powered agent assistance and language augmentation for video editing. InProceedings of the 29th International Conference on Intelligent User Interfaces. 699–714

work page 2024
[34]

Miao Wang, Guo-Wei Yang, Shi-Min Hu, Shing-Tung Yau, Ariel Shamir, et al

work page
[35]

Graph.38, 6 (2019), 177–1

Write-a-video: computational video montage from themed text.ACM Trans. Graph.38, 6 (2019), 177–1

work page 2019
[36]

Wikipedia contributors. 2026. Mashup (video) — Wikipedia, The Free Encyclo- pedia. https://en.wikipedia.org/wiki/Mashup_(video) Accessed: 2026-02-05

work page 2026
[37]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling

work page 2024
[38]

Yu Xiong, Fabian Caba Heilbron, and Dahua Lin. 2022. Transcript to video: Efficient clip sequencing from texts. InProceedings of the 30th ACM International Conference on Multimedia. 5407–5416

work page 2022
[39]

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084(2021)

work page arXiv 2021
[40]

Guoxing Yang, Haoyu Lu, Zelong Sun, and Zhiwu Lu. 2023. Shot retrieval and assembly with text script for video montage generation. InProceedings of the 2023 ACM International Conference on Multimedia Retrieval. 298–306

work page 2023
[41]

Zhengqing Yuan, Yixin Liu, Yihan Cao, Weixiang Sun, Haolong Jia, Ruoxi Chen, Zhaoxu Li, Bin Lin, Li Yuan, Lifang He, et al. 2024. Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248 (2024)

work page arXiv 2024
[42]

Zhengrong Yue, Shaobin Zhuang, Kunchang Li, Yanbo Ding, and Yali Wang

work page
[43]

InProceedings of the Computer Vision and Pattern Recognition Conference

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents. InProceedings of the Computer Vision and Pattern Recognition Conference. 3195–3205

work page
[44]

Sidan Zhu, Yutong Wang, Hongteng Xu, and Dixin Luo. 2025. Weakly-supervised movie trailer generation driven by multi-modal semantic consistency. In34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025. International Joint Conferences on Artificial Intelligence, 10234–10242

work page 2025
[45]

Zeyu Zhu, Kevin Qinghong Lin, and Mike Zheng Shou. 2025. Paper2video: Au- tomatic video generation from scientific papers.arXiv preprint arXiv:2510.05096 (2025)

work page arXiv 2025
[46]

section_name

Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang. 2024. Vlogger: Make your dream a vlog. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8806–8817. DIRECT: Supplementary Material In this document, we provide additional information including: • Details of metrics implementation, ...

work page arXiv 2024
[47]

De-Specify: Strictly remove proper nouns, specific actors, brand names, or real-world locations

work page
[48]

vehicle" instead of

Generalize: Abstract specific entities into broad categories (e. g., use "vehicle" instead of "sports car")

work page
[49]

Visual Invariants: Identify the constant elements across Subject, Action, Background, and Shot Type

work page
[50]

Footage Analysis Report

Vibe Extraction: Distill the environmental aesthetic and atmospheric tone into concise keywords. # Output Format - Visual Archetype: [The simplified common archetype description] - Visual Vibe: [3 keywords] E.2 Screenwriter (Summary Synthesis) # Role Footage Library Analyst: A high-level synthesizer for intelligent video editing systems, specialized in ci...

work page
[51]

Static/Composed) from the collective metadata

Aesthetic Deduction: Infer the global visual style (e.g., Gritty /Handheld vs. Static/Composed) from the collective metadata

work page
[52]

Keyword Distillation: Extract 10-15 high-impact cinematic keywords defining the library's visual DNA

work page
[53]

Cinematic Themes

Thematic Clustering: Categorize raw descriptions into 8-10 prominent "Cinematic Themes" suitable for highlight montages

work page
[54]

- Sentence 2: Defining visual features (vibe, shot type, environment)

Descriptive Precision: For each theme, provide a concise title and a two-sentence definition: - Sentence 1: General subject and action (de-specified). - Sentence 2: Defining visual features (vibe, shot type, environment). # Target Output Format Footage Analysis Report: [Title] - Visual Style & Tone: [Overall Aesthetic Deduction] - Global Keywords: [10-15 ...

work page
[55]

Global Narrative Flow: A high-level emotional and story arc

work page
[56]

# Operational Guidelines

Detailed Section Plan: A synchronized JSON mapping for every music segment. # Operational Guidelines

work page
[57]

Energy-Sync: Align visual intensity (Shot Type, Action) with musical energy (e.g., Intro=Low, Chorus=High)

work page
[58]

Multi-Dimensional Tagging: Each section must include Subject/ Action, Atmosphere/Vibe, Shot Type, and Energy Level

work page
[59]

intense fighting

Tag Generality: Use generic descriptors (e.g., "intense fighting ") instead of specific scenarios to maximize retrieval success

work page
[60]

Contrast & Diversity: Ensure visible variance in tags between adjacent sections to reflect musical transitions

work page
[61]

section_name

Completeness: Generate exactly one JSON object for every music section provided in the input. # Output Format ## Global Narrative Flow [Describe the overall story arc, emotional build-up, and climax.] ## Detailed Section Plan (Strictly JSON) [ { "section_name": "Section Title", "energy_level": "Low/Medium/High", "visual_tags": ["tag1", "tag2", "tag3", "ta...

work page
[62]

Vibe & Energy Alignment: Match the visual intensity and thematic tone to the current music section's energy level and keywords

work page
[63]

Subject + Action/Description

CLIP Optimization: Use simple "Subject + Action/Description" structures. Avoid complex adjectives; focus on core, retrievable visual elements to ensure a wide search range

work page
[64]

Avoid repeating the same subjects or actions to maintain montage dynamism

Diversity Protocol: Review the previous 4 segments'queries to ensure visual variety. Avoid repeating the same subjects or actions to maintain montage dynamism

work page
[65]

Prior Failure

Error Adaptation: If a "Prior Failure" is provided, analyze the feedback and further generalize the description or switch to a different subject within the same vibe to resolve the rejection. # Output Format (Strictly JSON) { "thought_process": "1. Analyze vibe/energy. 2. Verify history to avoid repetition. 3. Simplify to a 2-7 word retrieval string.", "r...

work page
[66]

g., high-speed action vs

Kinetic Assessment: Evaluate the query's motion requirements (e. g., high-speed action vs. static portraiture) and cinematic complexity

work page
[67]

- Motion_Continuity_Priority: For fluid action sequences (e.g., chases, racing)

Profile Mapping: Select exactly one profile from the predefined technical library: - Semantic_Priority: For narrative focus or specific subjects. - Motion_Continuity_Priority: For fluid action sequences (e.g., chases, racing). - Composition_Similarity_Priority: For static framing or match- cutting (e.g., close-ups). - Hybrid_Visual_Coherent: For intricate...

work page
[68]

thought_process

Contextual Integration: Factor in the current music section's energy level to prioritize either temporal flow or visual detail. # Output Format (Strictly JSON) { "thought_process": "Analysis of query kinetics, musical energy alignment, and profile selection logic.", "weight_profile": "Profile_Name" } E.6 Director (Rhythmic Pacing) # Role Director Agent (R...

work page
[69]

Temporal Constraints: A reasonable total segment duration is within 4-16 beats and must strictly $\le$ beats_remaining

work page
[70]

- Low Energy: Utilize long durations (4, 8, 16 beats) for sustained shots

Musical Alignment: - High Energy: Utilize short durations (1-2 beats) for rapid- fire editing. - Low Energy: Utilize long durations (4, 8, 16 beats) for sustained shots. - Triple Meter (3/4): Prioritize 3 or 6-beat cuts to maintain metric synchronicity

work page
[71]

thought_process

Adaptive Shot Density: - High Density (Common/Action): For abundant footage (e.g., running, fighting), use frequent cuts to increase visual dynamism. - Low Density (Specific/Emotional): For rare or narrative-heavy footage (e.g., crying, explosions), use fewer, longer shots to preserve detail. - Medium Density (Atmospheric): For landscapes or establishing ...

work page
[72]

Semantic Alignment: Verify if the candidate matches the core subject and action of the retrieval query (apply reasonable tolerance for non-literal matches)

work page
[73]

Structural Integrity: Check for visual coherence, avoiding technical glitches, jarring transitions, or artifacts

work page
[74]

Selection Pragmatism: Prioritize selection over rejection unless candidates are fundamentally unrelated to the query or aesthetically broken

work page
[75]

success": boolean, // False if all candidates fail critical criteria

Error Feedback: If rejecting all candidates, provide specific diagnostic reasons and actionable suggestions for query re- generation or subject switching. # Output Format (Strict JSON) { "success": boolean, // False if all candidates fail critical criteria "best_candidate": int | null, // 0-based index of the chosen clip "verdict": "string", // Concise pr...

work page

[1] [1]

Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, and Bernard Ghanem. 2024. Towards automated movie trailer generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7445–7454

work page 2024

[2] [2]

Brandon Castellano. [n. d.]. PySceneDetect: Python and OpenCV-based scene cut/transition detection program and library. https://github.com/Breakthrough/ PySceneDetect

work page

[3] [3]

Boris Chen, Amir Ziai, Rebecca S Tucker, and Yuchen Xie. 2023. Match cutting: Finding cuts with smooth visual transitions. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2115–2125

work page 2023

[4] [4]

Yaosen Chen, Wei Wang, Tianheng Zheng, Xuming Wen, Han Yang, and Yanru Zhang. 2025. ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing.arXiv preprint arXiv:2511.02505(2025)

work page arXiv 2025

[5] [5]

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24185–24198

work page 2024

[6] [6]

Zihan Ding, Xinyi Wang, Junlong Chen, Per Ola Kristensson, and Junxiao Shen

work page

[7] [7]

Prompt-Driven Agentic Video Editing System: Autonomous Compre- hension of Long-Form, Story-Driven Media.arXiv preprint arXiv:2509.16811 (2025)

work page arXiv 2025

[8] [8]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15180–15190

work page 2023

[9] [9]

HKUDS. 2025. VideoAgent: All-in-One Agentic Framework for Video Under- standing, Editing, and Remaking. GitHub Repository. https://github.com/ HKUDS/VideoAgent Accessed: 2026-02-05

work page 2025

[10] [10]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

work page

[11] [11]

InThe twelfth international conference on learning representations

MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations

work page

[12] [12]

Hang Hua, Yunlong Tang, Chenliang Xu, and Jiebo Luo. 2025. V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 3599–3607

work page 2025

[13] [13]

Taejun Kim and Juhan Nam. 2023. All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio. In2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (W ASPAA). IEEE, 1–5

work page 2023

[14] [14]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

work page 2023

[15] [15]

Dawon Lee, Jung Eun Yoo, Kyungmin Cho, Bumki Kim, Gyeonghun Im, and Junyong Noh. 2022. PopStage: The Generation of Stage Cross-Editing Video based on Spatio-Temporal Matching.ACM Transactions on Graphics (TOG)41, 6 (2022), 1–13

work page 2022

[16] [16]

Zicheng Liao, Yizhou Yu, Bingchen Gong, and Lechao Cheng. 2015. Audeosynth: music-driven video montage.ACM Transactions on Graphics (TOG)34, 4 (2015), 1–10

work page 2015

[17] [17]

Wu-Qin Liu, Min-Xuan Lin, Hai-Bin Huang, Chong-Yang Ma, Yu Song, Wei-Ming Dong, and Chang-Sheng Xu. 2023. Emotion-aware music driven movie montage. Journal of Computer Science and Technology38, 3 (2023), 540–553

work page 2023

[18] [18]

Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. 2024. Videostudio: Generating consistent-content and multi-scene videos. InEuropean Conference on Computer Vision. Springer, 468–485

work page 2024

[19] [19]

Chen-Yi Lu, Md Mehrab Tanjim, Ishita Dasgupta, Somdeb Sarkhel, Gang Wu, Saayan Mitra, and Somali Chaterji. 2025. Skald: Learning-based shot assembly for coherent multi-shot video creation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17859–17868

work page 2025

[20] [20]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing508 (2022), 293–304

work page 2022

[21] [21]

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. 2022. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM international conference on multimedia. 638–647

work page 2022

[22] [22]

2001.In the Blink of an Eye: A Perspective on Film Editing(2nd ed.)

Walter Murch. 2001.In the Blink of an Eye: A Perspective on Film Editing(2nd ed.). Silman-James Press, Los Angeles

work page 2001

[23] [23]

Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. 2021. Clip-it! language-guided video summarization.Advances in neural information pro- cessing systems34 (2021), 13988–14000

work page 2021

[24] [24]

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. 2020. U2-Net: Going deeper with nested U-structure for salient object detection.Pattern recognition106 (2020), 107404

work page 2020

[25] [25]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page

[26] [26]

In International conference on machine learning

Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

work page

[27] [27]

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang

work page

[28] [28]

Videorag: Retrieval-augmented generation with extreme long-context videos.arXiv preprint arXiv:2502.01549(2025)

work page arXiv 2025

[29] [29]

Marcelo Sandoval-Castaneda, Bryan Russell, Josef Sivic, Gregory Shakhnarovich, and Fabian Caba Heilbron. 2025. EditDuet: A Multi-Agent System for Video Non-Linear Editing. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. 1–11

work page 2025

[30] [30]

C Spearman. 2010. The proof and measurement of association between two things.International Journal of Epidemiology39, 5 (10 2010), 1137–1150. doi:10. 1093/ije/dyq191

work page 2010

[31] [31]

Qwen Team. 2025. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision. Springer, 402–419

work page 2020

[33] [33]

Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, and Raj Sodhi. 2024. Lave: Llm-powered agent assistance and language augmentation for video editing. InProceedings of the 29th International Conference on Intelligent User Interfaces. 699–714

work page 2024

[34] [34]

Miao Wang, Guo-Wei Yang, Shi-Min Hu, Shing-Tung Yau, Ariel Shamir, et al

work page

[35] [35]

Graph.38, 6 (2019), 177–1

Write-a-video: computational video montage from themed text.ACM Trans. Graph.38, 6 (2019), 177–1

work page 2019

[36] [36]

Wikipedia contributors. 2026. Mashup (video) — Wikipedia, The Free Encyclo- pedia. https://en.wikipedia.org/wiki/Mashup_(video) Accessed: 2026-02-05

work page 2026

[37] [37]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling

work page 2024

[38] [38]

Yu Xiong, Fabian Caba Heilbron, and Dahua Lin. 2022. Transcript to video: Efficient clip sequencing from texts. InProceedings of the 30th ACM International Conference on Multimedia. 5407–5416

work page 2022

[39] [39]

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084(2021)

work page arXiv 2021

[40] [40]

Guoxing Yang, Haoyu Lu, Zelong Sun, and Zhiwu Lu. 2023. Shot retrieval and assembly with text script for video montage generation. InProceedings of the 2023 ACM International Conference on Multimedia Retrieval. 298–306

work page 2023

[41] [41]

Zhengqing Yuan, Yixin Liu, Yihan Cao, Weixiang Sun, Haolong Jia, Ruoxi Chen, Zhaoxu Li, Bin Lin, Li Yuan, Lifang He, et al. 2024. Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248 (2024)

work page arXiv 2024

[42] [42]

Zhengrong Yue, Shaobin Zhuang, Kunchang Li, Yanbo Ding, and Yali Wang

work page

[43] [43]

InProceedings of the Computer Vision and Pattern Recognition Conference

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents. InProceedings of the Computer Vision and Pattern Recognition Conference. 3195–3205

work page

[44] [44]

Sidan Zhu, Yutong Wang, Hongteng Xu, and Dixin Luo. 2025. Weakly-supervised movie trailer generation driven by multi-modal semantic consistency. In34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025. International Joint Conferences on Artificial Intelligence, 10234–10242

work page 2025

[45] [45]

Zeyu Zhu, Kevin Qinghong Lin, and Mike Zheng Shou. 2025. Paper2video: Au- tomatic video generation from scientific papers.arXiv preprint arXiv:2510.05096 (2025)

work page arXiv 2025

[46] [46]

section_name

Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang. 2024. Vlogger: Make your dream a vlog. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8806–8817. DIRECT: Supplementary Material In this document, we provide additional information including: • Details of metrics implementation, ...

work page arXiv 2024

[47] [47]

De-Specify: Strictly remove proper nouns, specific actors, brand names, or real-world locations

work page

[48] [48]

vehicle" instead of

Generalize: Abstract specific entities into broad categories (e. g., use "vehicle" instead of "sports car")

work page

[49] [49]

Visual Invariants: Identify the constant elements across Subject, Action, Background, and Shot Type

work page

[50] [50]

Footage Analysis Report

Vibe Extraction: Distill the environmental aesthetic and atmospheric tone into concise keywords. # Output Format - Visual Archetype: [The simplified common archetype description] - Visual Vibe: [3 keywords] E.2 Screenwriter (Summary Synthesis) # Role Footage Library Analyst: A high-level synthesizer for intelligent video editing systems, specialized in ci...

work page

[51] [51]

Static/Composed) from the collective metadata

Aesthetic Deduction: Infer the global visual style (e.g., Gritty /Handheld vs. Static/Composed) from the collective metadata

work page

[52] [52]

Keyword Distillation: Extract 10-15 high-impact cinematic keywords defining the library's visual DNA

work page

[53] [53]

Cinematic Themes

Thematic Clustering: Categorize raw descriptions into 8-10 prominent "Cinematic Themes" suitable for highlight montages

work page

[54] [54]

- Sentence 2: Defining visual features (vibe, shot type, environment)

Descriptive Precision: For each theme, provide a concise title and a two-sentence definition: - Sentence 1: General subject and action (de-specified). - Sentence 2: Defining visual features (vibe, shot type, environment). # Target Output Format Footage Analysis Report: [Title] - Visual Style & Tone: [Overall Aesthetic Deduction] - Global Keywords: [10-15 ...

work page

[55] [55]

Global Narrative Flow: A high-level emotional and story arc

work page

[56] [56]

# Operational Guidelines

Detailed Section Plan: A synchronized JSON mapping for every music segment. # Operational Guidelines

work page

[57] [57]

Energy-Sync: Align visual intensity (Shot Type, Action) with musical energy (e.g., Intro=Low, Chorus=High)

work page

[58] [58]

Multi-Dimensional Tagging: Each section must include Subject/ Action, Atmosphere/Vibe, Shot Type, and Energy Level

work page

[59] [59]

intense fighting

Tag Generality: Use generic descriptors (e.g., "intense fighting ") instead of specific scenarios to maximize retrieval success

work page

[60] [60]

Contrast & Diversity: Ensure visible variance in tags between adjacent sections to reflect musical transitions

work page

[61] [61]

section_name

Completeness: Generate exactly one JSON object for every music section provided in the input. # Output Format ## Global Narrative Flow [Describe the overall story arc, emotional build-up, and climax.] ## Detailed Section Plan (Strictly JSON) [ { "section_name": "Section Title", "energy_level": "Low/Medium/High", "visual_tags": ["tag1", "tag2", "tag3", "ta...

work page

[62] [62]

Vibe & Energy Alignment: Match the visual intensity and thematic tone to the current music section's energy level and keywords

work page

[63] [63]

Subject + Action/Description

CLIP Optimization: Use simple "Subject + Action/Description" structures. Avoid complex adjectives; focus on core, retrievable visual elements to ensure a wide search range

work page

[64] [64]

Avoid repeating the same subjects or actions to maintain montage dynamism

Diversity Protocol: Review the previous 4 segments'queries to ensure visual variety. Avoid repeating the same subjects or actions to maintain montage dynamism

work page

[65] [65]

Prior Failure

Error Adaptation: If a "Prior Failure" is provided, analyze the feedback and further generalize the description or switch to a different subject within the same vibe to resolve the rejection. # Output Format (Strictly JSON) { "thought_process": "1. Analyze vibe/energy. 2. Verify history to avoid repetition. 3. Simplify to a 2-7 word retrieval string.", "r...

work page

[66] [66]

g., high-speed action vs

Kinetic Assessment: Evaluate the query's motion requirements (e. g., high-speed action vs. static portraiture) and cinematic complexity

work page

[67] [67]

- Motion_Continuity_Priority: For fluid action sequences (e.g., chases, racing)

Profile Mapping: Select exactly one profile from the predefined technical library: - Semantic_Priority: For narrative focus or specific subjects. - Motion_Continuity_Priority: For fluid action sequences (e.g., chases, racing). - Composition_Similarity_Priority: For static framing or match- cutting (e.g., close-ups). - Hybrid_Visual_Coherent: For intricate...

work page

[68] [68]

thought_process

Contextual Integration: Factor in the current music section's energy level to prioritize either temporal flow or visual detail. # Output Format (Strictly JSON) { "thought_process": "Analysis of query kinetics, musical energy alignment, and profile selection logic.", "weight_profile": "Profile_Name" } E.6 Director (Rhythmic Pacing) # Role Director Agent (R...

work page

[69] [69]

Temporal Constraints: A reasonable total segment duration is within 4-16 beats and must strictly $\le$ beats_remaining

work page

[70] [70]

- Low Energy: Utilize long durations (4, 8, 16 beats) for sustained shots

Musical Alignment: - High Energy: Utilize short durations (1-2 beats) for rapid- fire editing. - Low Energy: Utilize long durations (4, 8, 16 beats) for sustained shots. - Triple Meter (3/4): Prioritize 3 or 6-beat cuts to maintain metric synchronicity

work page

[71] [71]

thought_process

Adaptive Shot Density: - High Density (Common/Action): For abundant footage (e.g., running, fighting), use frequent cuts to increase visual dynamism. - Low Density (Specific/Emotional): For rare or narrative-heavy footage (e.g., crying, explosions), use fewer, longer shots to preserve detail. - Medium Density (Atmospheric): For landscapes or establishing ...

work page

[72] [72]

Semantic Alignment: Verify if the candidate matches the core subject and action of the retrieval query (apply reasonable tolerance for non-literal matches)

work page

[73] [73]

Structural Integrity: Check for visual coherence, avoiding technical glitches, jarring transitions, or artifacts

work page

[74] [74]

Selection Pragmatism: Prioritize selection over rejection unless candidates are fundamentally unrelated to the query or aesthetically broken

work page

[75] [75]

success": boolean, // False if all candidates fail critical criteria

Error Feedback: If rejecting all candidates, provide specific diagnostic reasons and actionable suggestions for query re- generation or subject switching. # Output Format (Strict JSON) { "success": boolean, // False if all candidates fail critical criteria "best_candidate": int | null, // 0-based index of the chosen clip "verdict": "string", // Concise pr...

work page