arxiv: 2604.10456 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation

Peixuan Zhang , Chang Zhou , Ziyuan Zhang , Hualuo Liu , Chunjie Zhang , Jingqi Liu , Xiaohui Zhou , Xi Chen

show 3 more authors

Shuchen Weng Si Li Boxin Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords cinematic video compilationmulti-agent systeminstruction-driven editingvideo benchmarknarrative coherencescript planningshort video generation

0 comments

The pith

CineAgents reformulates cinematic video compilation as a design-and-compose process to achieve superior narrative and logical coherence from user instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CineBench as the first benchmark with diverse instructions and professional editor annotations as ground truth for instruction-driven cinematic video compilation. It proposes CineAgents, a multi-agent system that constructs a hierarchical narrative memory through script reverse-engineering and refines a creative blueprint via iterative narrative planning to address contextual collapse and temporal fragmentation. A sympathetic reader would care because surging demand exists for turning long-form cinematic content into short videos, yet existing methods remain restricted to predefined tasks and lack versatile evaluation standards.

Core claim

CineAgents significantly outperforms existing methods by generating compilations with superior narrative coherence and logical coherence, achieved by reformulating the task into a design-and-compose paradigm that performs script reverse-engineering to build hierarchical narrative memory for multi-level context and employs an iterative narrative planning process to refine a creative blueprint into a final compiled script.

What carries the argument

The design-and-compose paradigm in the CineAgents multi-agent system, which uses script reverse-engineering for hierarchical narrative memory and iterative planning to produce the compiled script.

If this is right

Cinematic video compilation can move beyond predefined tasks to handle a wide range of user instructions with improved coherence.
Future systems can be evaluated against a shared benchmark with professional ground-truth compilations.
Automatic adaptation of long-form content yields short videos with stronger narrative flow and logical structure.
Multi-level context from hierarchical memory reduces fragmentation in compiled results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hierarchical memory construction might transfer to automated summarization of other time-based media such as lectures or sports broadcasts.
Iterative planning in a multi-agent setup could support hybrid human-AI video editing pipelines where editors refine the blueprint.
If the benchmark expands, it might reveal whether similar coherence gains appear in non-cinematic domains like documentary or tutorial compilation.

Load-bearing premise

Annotations by professional editors in CineBench provide an objective and comprehensive ground truth for evaluating cinematic compilation quality across diverse instructions.

What would settle it

Independent human raters scoring CineAgents outputs as equal or inferior in narrative and logical coherence to existing methods on the same CineBench instructions would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2604.10456 by Boxin Shi, Chang Zhou, Chunjie Zhang, Hualuo Liu, Jingqi Liu, Peixuan Zhang, Shuchen Weng, Si Li, Xiaohui Zhou, Xi Chen, Ziyuan Zhang.

**Figure 1.** Figure 1: Comparison of video compilation paradigms. (a) Existing video compilation methods typically rely on a “retrieve-and-rank” paradigm. When applied to complex cinematic videos, this isolated approach inevitably leads to contextual collapse and temporal fragmentation, leading to narratively incoherent results. (b) In contrast, our proposed CineAgents introduces a novel “design-and-compose” paradigm. Through mu… view at source ↗

**Figure 2.** Figure 2: Overview and statistic of CineBench. (a) Examples of instructions and compiled video clips. (b) Distribution of video release years, types and instruction lengths. (c) Word cloud of instruction keyword frequencies. 3.1 Data Curation and Task Formulation To construct a comprehensive and diverse benchmark, we first curate cinematic videos from both English- and Chinese-language films and television series9 .… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed CineAgents. Given a user instruction and a set of source videos, our system generates a final compiled video through: (a) Script reverse-engineering: The script agent integrates the multimodal information from the source videos to construct a hierarchical narrative memory, which provides a multilevel context to overcome contextual collapse (Sec. 4.2). (b) Cinematic sequence produc… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with state-of-the-art methods. Segments in blue are correct in both content and temporal position, aligning with the human reference (GT); green segments are semantically correct but temporally misplaced; orange segments are semantically irrelevant to the instruction; and dashed regions represent areas where the model fails to retrieve the corresponding content. idence compellingly v… view at source ↗

**Figure 5.** Figure 5: User study results on Live-Action and Animated videos. Quality Evaluation (OQE): Participants select the best video, considering all factors including content relevance, narrative flow, and overall viewing experience. For each study, we randomly selected 50 samples from CineBench and recruited 25 volunteers to provide independent evaluations. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Visual example of CineAgents compilation performance on animated content. This highlights the model’s generalization capability beyond live-action video. – Video generation and editing are concerned with pixel-level manipulation. They focus on synthesizing visual content or altering the pixels of frames. – Video understanding involves the passive analysis of video content. Its goal is to interpret semantic… view at source ↗

read the original abstract

The surging demand for adapting long-form cinematic content into short videos has motivated the need for versatile automatic video compilation systems. However, existing compilation methods are limited to predefined tasks, and the community lacks a comprehensive benchmark to evaluate the cinematic compilation. To address this, we introduce CineBench, the first benchmark for instruction-driven cinematic video compilation, featuring diverse user instructions and high-quality ground-truth compilations annotated by professional editors. To overcome contextual collapse and temporal fragmentation, we present CineAgents, a multi-agent system that reformulates cinematic video compilation into ``design-and-compose'' paradigm. CineAgents performs script reverse-engineering to construct a hierarchical narrative memory to provide multi-level context and employs an iterative narrative planning process that refines a creative blueprint into a final compiled script. Extensive experiments demonstrate that CineAgents significantly outperforms existing methods, generating compilations with superior narrative coherence and logical coherence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CineBench and CineAgents introduce the first benchmark plus a multi-agent design-and-compose system for instruction-driven cinematic video compilation, but the superiority claims need concrete metrics and checks on annotation reliability to be convincing.

read the letter

The main takeaway is that this work creates CineBench as a new evaluation set for turning long cinematic videos into short compilations based on natural language instructions, along with CineAgents, a multi-agent setup that does script reverse-engineering to build hierarchical narrative memory and then iterates on a creative blueprint before final assembly. That paradigm directly targets context loss and story fragmentation, which are real pain points in video editing pipelines. The benchmark with professional-editor ground truth is the clearest addition, since prior methods were stuck on narrower, predefined tasks without a shared testbed. The approach itself applies existing multi-agent ideas in a structured way that fits the creative constraints of cinematic material. It is a practical step for short-form content adaptation, and the hierarchical memory plus iterative planning looks like a sensible way to keep narrative flow intact across shots. The paper stays grounded in the task without overclaiming broader theoretical advances. The soft spots sit in the evaluation. The abstract states that experiments show better narrative and logical coherence, yet it gives no numbers, no listed baselines, no description of the coherence metric, and no statistical details. If the full paper supplies those, the claim can be checked; otherwise it remains hard to verify. The stress-test concern also lands: professional annotations can embed unstated stylistic choices or miss edge cases like non-linear instructions or strict continuity rules. Without reported inter-annotator agreement, coverage stats, or sensitivity checks, higher scores on CineBench could reflect closer imitation of the annotation style rather than genuine coherence gains. No circular logic or invented entities appear in the setup. The work is aimed at computer vision and media-AI researchers who build or benchmark video compilation tools. Readers working on multi-agent creative systems or video summarization would find the benchmark and the design-and-compose framing useful to build on. It deserves a serious referee because the benchmark is new and the task has clear application value, even though the experiments will need tighter documentation on metrics and data quality. I would send it to review with a request to expand the evaluation section and address potential annotation biases.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce CineBench, the first benchmark for instruction-driven cinematic video compilation featuring diverse user instructions and high-quality ground-truth compilations annotated by professional editors. It also presents CineAgents, a multi-agent system reformulating the task into a 'design-and-compose' paradigm that performs script reverse-engineering to build a hierarchical narrative memory and uses iterative narrative planning to refine a creative blueprint, claiming through extensive experiments that it significantly outperforms existing methods in narrative coherence and logical coherence.

Significance. If the empirical claims hold, this work supplies a needed benchmark resource and a structured multi-agent method for handling narrative and temporal issues in video compilation, which could support future research on adapting long-form cinematic content. The creation of a professionally annotated benchmark and the explicit 'design-and-compose' reformulation are concrete contributions that enable falsifiable comparisons.

major comments (2)

[CineBench] CineBench section: the superiority claims rest on professional-editor annotations serving as objective ground truth for narrative and logical coherence across diverse instructions, yet no inter-annotator agreement statistics, coverage analysis of instruction types (e.g., non-linear narratives or cross-shot constraints), or sensitivity tests are reported; this is load-bearing for validating that higher scores reflect genuine coherence gains rather than style imitation.
[Experiments] Experiments section: the abstract and results assert that CineAgents 'significantly outperforms existing methods' with 'superior narrative coherence and logical coherence,' but no concrete metrics, baseline implementations, coherence scoring protocol, or statistical tests are detailed in the provided description, preventing verification of the central empirical claim.

minor comments (2)

[Introduction] The 'design-and-compose' paradigm is referenced repeatedly without a concise formal definition or explicit contrast to prior multi-agent decomposition strategies.
[Figure 2] Figure captions for the system architecture would be clearer if they explicitly labeled the hierarchical memory construction and iterative refinement loops.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and will incorporate revisions to strengthen the validation of CineBench and the transparency of our experimental reporting.

read point-by-point responses

Referee: [CineBench] CineBench section: the superiority claims rest on professional-editor annotations serving as objective ground truth for narrative and logical coherence across diverse instructions, yet no inter-annotator agreement statistics, coverage analysis of instruction types (e.g., non-linear narratives or cross-shot constraints), or sensitivity tests are reported; this is load-bearing for validating that higher scores reflect genuine coherence gains rather than style imitation.

Authors: We agree that inter-annotator agreement statistics are essential to substantiate the reliability of the professional-editor annotations as ground truth. In the revised manuscript, we will report inter-annotator agreement metrics (e.g., Fleiss' kappa) computed across the annotators for both narrative and logical coherence scores. We will also add a coverage analysis subsection detailing the distribution of instruction types in CineBench, explicitly quantifying the inclusion of non-linear narratives and cross-shot constraints. Sensitivity tests will be included via an analysis of how minor variations in annotations impact benchmark outcomes. These elements will be added to the CineBench section to directly address the concern. revision: yes
Referee: [Experiments] Experiments section: the abstract and results assert that CineAgents 'significantly outperforms existing methods' with 'superior narrative coherence and logical coherence,' but no concrete metrics, baseline implementations, coherence scoring protocol, or statistical tests are detailed in the provided description, preventing verification of the central empirical claim.

Authors: The full experiments section provides concrete metrics (narrative coherence and logical coherence scores), descriptions of baseline implementations, and the coherence scoring protocol used by human evaluators. However, to enhance verifiability as noted, we will expand the section with a more explicit step-by-step description of the scoring protocol, additional implementation details for all baselines, and statistical significance tests (including p-values from appropriate tests such as paired t-tests). These clarifications and expansions will be integrated into the Experiments section and associated tables. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical evaluation on new benchmark

full rationale

The paper introduces CineBench as a new benchmark with professional-editor annotations and presents CineAgents as a multi-agent system using standard techniques (script reverse-engineering, hierarchical memory, iterative planning). The central claim of outperformance is an empirical comparison on this benchmark, with no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its inputs by construction. No self-definitional steps, uniqueness theorems from prior author work, or ansatzes smuggled via citation are present in the provided text. The derivation is self-contained as a standard benchmark-plus-method construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, mathematical axioms, or invented physical entities are described. The work relies on standard multi-agent LLM frameworks and professional annotations without introducing new conserved quantities or ungrounded entities.

pith-pipeline@v0.9.0 · 5476 in / 1177 out tokens · 28175 ms · 2026-05-10T15:24:21.411501+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Accessed February 25, 2025 [Online]https://claude.ai/4, 11

Claude-3.7-sonnet. Accessed February 25, 2025 [Online]https://claude.ai/4, 11

2025
[2]

[Online]https://github.com/deepinsight/insightface/7, 8

Insightface. [Online]https://github.com/deepinsight/insightface/7, 8
[3]

[Online]https://v.qq.com/1

Tencent video. [Online]https://v.qq.com/1
[4]

[Online]https://www.tiktok.com/1

Tiktok short. [Online]https://www.tiktok.com/1
[5]

[Online]https://www.youtube.com/shorts/1

Youtube short. [Online]https://www.youtube.com/shorts/1
[6]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Argaw, D.M., Soldan, M., Pardo, A., Zhao, C., Heilbron, F.C., Chung, J.S., Ghanem, B.: Towards automated movie trailer generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7445– 7454 (2024) 2, 4

2024
[8]

Whisperx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747, 2023

Bain, M., Huh, J., Han, T., Zisserman, A.: Whisperx: Time-accurate speech tran- scription of long-form audio. arXiv preprint arXiv:2303.00747 (2023) 8

work page arXiv 2023
[9]

arXiv preprint arXiv:2502.07096 (2025) 4

Barua, A., Benharrak, K., Chen, M., Huh, M., Pavel, A.: Lotus: Creating short videos from long videos with abstractive and extractive summarization. arXiv preprint arXiv:2502.07096 (2025) 4

work page arXiv 2025
[10]

NeurIPS (2020) 4

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. NeurIPS (2020) 4

2020
[11]

In: ECCV (2024) 4

Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: ECCV (2024) 4

2024
[12]

In: CVPR (2023) 7

Chen, W., Xu, X., Jia, J., Luo, H., Wang, Y., Wang, F., Jin, R., Sun, X.: Beyond appearance: a semantic controllable self-supervised learning framework for human- centric visual tasks. In: CVPR (2023) 7

2023
[13]

Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., Zhang, Y., Lv, W., Huang, K., Zhang, Y., Zhang, J., Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report (2025),https://arxiv.org/ abs/2507.055958

work page internal anchor Pith review arXiv 2025
[14]

In: CVPR (2023) 2, 4

Gan, B., Shu, X., Qiao, R., Wu, H., Chen, K., Li, H., Ren, B.: Collaborative noisy label cleaner: Learning scene-aware trailers for multi-modal highlight detection in movies. In: CVPR (2023) 2, 4

2023
[15]

In: CVPR (2016) 8 16 Zhang et al

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 8 16 Zhang et al

2016
[16]

In: ICLR (2023) 4, 11

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S.K.S., Lin, Z., et al.: Metagpt: Meta programming for a multi-agent collaborative framework. In: ICLR (2023) 4, 11

2023
[17]

In: ACMMM (2023) 2, 4

Hu, P., Xiao, N., Li, F., Chen, Y., Huang, R.: A reinforcement learning-based auto- matic video editing method using pre-trained vision-language model. In: ACMMM (2023) 2, 4

2023
[18]

In: CVPR (2024) 6

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video genera- tive models. In: CVPR (2024) 6

2024
[19]

In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Huber, B., Shin, H.V., Russell, B., Wang, O., Mysore, G.J.: B-script: Transcript- based b-roll video editing with recommendations. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. pp. 1–11 (2019) 2, 4

2019
[20]

In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023) 3

Huh, M., Yang, S., Peng, Y.H., Chen, X., Kim, Y.H., Pavel, A.: Avscript: Accessible video editing with audio-visual scripts. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023) 3

2023
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Koorathota, S., Adelman, P., Cotton, K., Sajda, P.: Editing like humans: a contex- tual, multimodal framework for automated video editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1701– 1709 (2021) 2, 4

2021
[22]

biometrics (1977) 6

Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics (1977) 6

1977
[23]

In: CVPR (2024) 4

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 4

2024
[24]

In: NeurIPS (2023) 4

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 4

2023
[25]

arXiv preprint arXiv:2502.17038 (2025) 1

Lu, J., Xiao, M., Wang, W., Du, Y., Wu, Z., Hua, C.: Multi-modal and metadata capture model for micro video popularity prediction. arXiv preprint arXiv:2502.17038 (2025) 1

work page arXiv 2025
[26]

arXiv preprint arXiv:2201.05277 (2022) 9

Mun, J., Shin, M., Han, G., Lee, S., Ha, S., Lee, J., Kim, E.S.: Boundary-aware self- supervised learning for video scene segmentation. arXiv preprint arXiv:2201.05277 (2022) 9

work page arXiv 2022
[27]

In: ICCV (2021) 2

Pardo, A., Caba, F., Alcázar, J.L., Thabet, A.K., Ghanem, B.: Learning to cut by watching movies. In: ICCV (2021) 2

2021
[28]

arXiv preprint arXiv:2411.12293 (2024) 2, 4

Pardo, A., Wang, J.H., Ghanem, B., Sivic, J., Russell, B., Heilbron, F.C.: Gen- erative timelines for instructed visual assembly. arXiv preprint arXiv:2411.12293 (2024) 2, 4

work page arXiv 2024
[29]

In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology

Pavel, A., Reyes, G., Bigham, J.P.: Rescribe: Authoring and automatically editing audio descriptions. In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. pp. 747–759 (2020) 3

2020
[30]

arXiv preprint arXiv:2504.19894 (2025) 4

Phung, Q., Mai, L., Heilbron, F.D.C., Liu, F., Huang, J.B., Ham, C.: Cineverse: Consistent keyframe synthesis for cinematic scene composition. arXiv preprint arXiv:2504.19894 (2025) 4

work page arXiv 2025
[31]

In: Advances in Natu- ralComputation,FuzzySystemsandKnowledgeDiscovery:Volume1.pp.361–368

Podlesnyy, S.: Towards data-driven automatic video editing. In: Advances in Natu- ralComputation,FuzzySystemsandKnowledgeDiscovery:Volume1.pp.361–368. Springer (2020) 2, 4

2020
[32]

In: ICML (2023) 8

Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: ICML (2023) 8

2023
[33]

Sandoval-Castaneda, M., Russell, B., Sivic, J., Shakhnarovich, G., Caba Heilbron, F.: Editduet: A multi-agent system for video non-linear editing. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers (2025) 14 Instruction-driven Cinematic Video Compilation 17

2025
[34]

want and how far is AI from delivering them (2021) 2

Soe, T.H., Slavkovik, M.: Ai video editing tools. want and how far is AI from delivering them (2021) 2

2021
[35]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 4, 6, 9, 10, 11, 14

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology

Truong, A., Berthouzoz, F., Li, W., Agrawala, M.: Quickcut: An interactive tool for editing narrated video. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology. pp. 497–507 (2016) 2, 3

2016
[39]

Journal of the ACM (JACM) (1974) 6

Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM (JACM) (1974) 6

1974
[40]

In: Proceedings of the 29th International Conference on Intelligent User Interfaces

Wang, B., Li, Y., Lv, Z., Xia, H., Xu, Y., Sodhi, R.: Lave: Llm-powered agent assistance and language augmentation for video editing. In: Proceedings of the 29th International Conference on Intelligent User Interfaces. pp. 699–714 (2024) 11

2024
[41]

In: ICASSP (2023) 8

Wang, H., Liang, C., Wang, S., Chen, Z., Zhang, B., Xiang, X., Deng, Y., Qian, Y.: Wespeaker: A research and production oriented speaker embedding learning toolkit. In: ICASSP (2023) 8

2023
[42]

In: ECCV (2020) 2, 4

Wang, L., Liu, D., Puri, R., Metaxas, D.N.: Learning trailer moments in full-length movies with co-contrastive attention. In: ECCV (2020) 2, 4

2020
[43]

ACM Trans

Wang, M., Yang, G.W., Hu, S.M., Yau, S.T., Shamir, A., et al.: Write-a-video: computational video montage from themed text. ACM Trans. Graph.38(6), 177– 1 (2019) 3

2019
[44]

arXiv preprint arXiv:2507.02790 (2025) 2, 4, 14

Wang, X., Li, X., Wei, Y., Song, X., Song, Y., Xia, X., Zeng, F., Chen, Z., Liu, L., Xu, G., et al.: From long videos to engaging clips: A human-inspired video editing framework with multimodal narrative understanding. arXiv preprint arXiv:2507.02790 (2025) 2, 4, 14

work page arXiv 2025
[45]

NeurIPS (2022) 11

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022) 11

2022
[46]

arXiv preprint arXiv:2503.07314 (2025)

Wu, W., Zhu, Z., Shou, M.Z.: Automated movie generation via multi-agent cot planning. arXiv preprint arXiv:2503.07314 (2025) 4

work page arXiv 2025
[47]

In: ACMMM (2022) 2

Xiong, Y., Heilbron, F.C., Lin, D.: Transcript to video: Efficient clip sequencing from texts. In: ACMMM (2022) 2

2022
[48]

In: SIGGRAPH Asia (2024) 4

Xu, Z., Wang, J., Wang, L., Li, Z., Shi, S., Hu, B., Zhang, M.: Filmagent: Au- tomating virtual film production through a multi-agent collaborative framework. In: SIGGRAPH Asia (2024) 4

2024
[49]

arXiv preprint arXiv:2512.12372 (2025) 4

Zhang, P., Jia, Z., Liu, K., Weng, S., Li, S., Shi, B.: Stage: Storyboard-anchored generation for cinematic multi-shot narrative. arXiv preprint arXiv:2512.12372 (2025) 4

work page arXiv 2025
[50]

Zhang, P., Weng, S., Tang, J., Li, S., Shi, B.: Towards deeper emotional reflection: Craftingaffectiveimagefilterswithgenerativepriors.IEEEtransactionsonpattern analysis and machine intelligence (2025) 4

2025
[51]

In: NeurIPS (2024) 4 18 Zhang et al

Zhang, Y., Sun, R., Chen, Y., Pfister, T., Zhang, R., Arik, S.: Chain of agents: Large language models collaborating on long-context tasks. In: NeurIPS (2024) 4 18 Zhang et al

2024
[52]

In: NeurIPS NextVid Workshop Oral (2025) 4

Zheng, M., Xu, Y., Huang, H., Ma, X., Liu, Y., Shu, W., Pang, Y., Tang, F., Chen, Q., Yang, H., Sernam, L.: Videogen-of-thought: A collaborative framework for multi-shot video generation. In: NeurIPS NextVid Workshop Oral (2025) 4

2025
[53]

Zhou, H., Huang, L., Wu, S., Xia, L., Huang, C., et al.: Videoagent: All-in-one agentic framework for video understanding and editing 11
[54]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhu, W., Huang, Y., Xie, X., Liu, W., Deng, J., Zhang, D., Wang, Z., Liu, J.: Autoshot: A short video dataset and state-of-the-art shot boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2238–2247 (2023) 9

2023