pith. machine review for the scientific record. sign in

arxiv: 2604.10456 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords cinematic video compilationmulti-agent systeminstruction-driven editingvideo benchmarknarrative coherencescript planningshort video generation
0
0 comments X

The pith

CineAgents reformulates cinematic video compilation as a design-and-compose process to achieve superior narrative and logical coherence from user instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CineBench as the first benchmark with diverse instructions and professional editor annotations as ground truth for instruction-driven cinematic video compilation. It proposes CineAgents, a multi-agent system that constructs a hierarchical narrative memory through script reverse-engineering and refines a creative blueprint via iterative narrative planning to address contextual collapse and temporal fragmentation. A sympathetic reader would care because surging demand exists for turning long-form cinematic content into short videos, yet existing methods remain restricted to predefined tasks and lack versatile evaluation standards.

Core claim

CineAgents significantly outperforms existing methods by generating compilations with superior narrative coherence and logical coherence, achieved by reformulating the task into a design-and-compose paradigm that performs script reverse-engineering to build hierarchical narrative memory for multi-level context and employs an iterative narrative planning process to refine a creative blueprint into a final compiled script.

What carries the argument

The design-and-compose paradigm in the CineAgents multi-agent system, which uses script reverse-engineering for hierarchical narrative memory and iterative planning to produce the compiled script.

If this is right

  • Cinematic video compilation can move beyond predefined tasks to handle a wide range of user instructions with improved coherence.
  • Future systems can be evaluated against a shared benchmark with professional ground-truth compilations.
  • Automatic adaptation of long-form content yields short videos with stronger narrative flow and logical structure.
  • Multi-level context from hierarchical memory reduces fragmentation in compiled results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hierarchical memory construction might transfer to automated summarization of other time-based media such as lectures or sports broadcasts.
  • Iterative planning in a multi-agent setup could support hybrid human-AI video editing pipelines where editors refine the blueprint.
  • If the benchmark expands, it might reveal whether similar coherence gains appear in non-cinematic domains like documentary or tutorial compilation.

Load-bearing premise

Annotations by professional editors in CineBench provide an objective and comprehensive ground truth for evaluating cinematic compilation quality across diverse instructions.

What would settle it

Independent human raters scoring CineAgents outputs as equal or inferior in narrative and logical coherence to existing methods on the same CineBench instructions would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2604.10456 by Boxin Shi, Chang Zhou, Chunjie Zhang, Hualuo Liu, Jingqi Liu, Peixuan Zhang, Shuchen Weng, Si Li, Xiaohui Zhou, Xi Chen, Ziyuan Zhang.

Figure 1
Figure 1. Figure 1: Comparison of video compilation paradigms. (a) Existing video compilation methods typically rely on a “retrieve-and-rank” paradigm. When applied to complex cinematic videos, this isolated approach inevitably leads to contextual collapse and temporal fragmentation, leading to narratively incoherent results. (b) In contrast, our proposed CineAgents introduces a novel “design-and-compose” paradigm. Through mu… view at source ↗
Figure 2
Figure 2. Figure 2: Overview and statistic of CineBench. (a) Examples of instructions and compiled video clips. (b) Distribution of video release years, types and instruction lengths. (c) Word cloud of instruction keyword frequencies. 3.1 Data Curation and Task Formulation To construct a comprehensive and diverse benchmark, we first curate cinematic videos from both English- and Chinese-language films and television series9 .… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed CineAgents. Given a user instruction and a set of source videos, our system generates a final compiled video through: (a) Script reverse-engineering: The script agent integrates the multimodal information from the source videos to construct a hierarchical narrative memory, which provides a multi￾level context to overcome contextual collapse (Sec. 4.2). (b) Cinematic sequence produc… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with state-of-the-art methods. Segments in blue are correct in both content and temporal position, aligning with the human reference (GT); green segments are semantically correct but temporally misplaced; orange segments are semantically irrelevant to the instruction; and dashed regions represent areas where the model fails to retrieve the corresponding content. idence compellingly v… view at source ↗
Figure 5
Figure 5. Figure 5: User study results on Live-Action and Animated videos. Quality Evaluation (OQE): Participants select the best video, considering all factors including content relevance, narrative flow, and overall viewing expe￾rience. For each study, we randomly selected 50 samples from CineBench and recruited 25 volunteers to provide independent evaluations. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual example of CineAgents compilation performance on animated content. This highlights the model’s generalization capability beyond live-action video. – Video generation and editing are concerned with pixel-level manipulation. They focus on synthesizing visual content or altering the pixels of frames. – Video understanding involves the passive analysis of video content. Its goal is to interpret semantic… view at source ↗
read the original abstract

The surging demand for adapting long-form cinematic content into short videos has motivated the need for versatile automatic video compilation systems. However, existing compilation methods are limited to predefined tasks, and the community lacks a comprehensive benchmark to evaluate the cinematic compilation. To address this, we introduce CineBench, the first benchmark for instruction-driven cinematic video compilation, featuring diverse user instructions and high-quality ground-truth compilations annotated by professional editors. To overcome contextual collapse and temporal fragmentation, we present CineAgents, a multi-agent system that reformulates cinematic video compilation into ``design-and-compose'' paradigm. CineAgents performs script reverse-engineering to construct a hierarchical narrative memory to provide multi-level context and employs an iterative narrative planning process that refines a creative blueprint into a final compiled script. Extensive experiments demonstrate that CineAgents significantly outperforms existing methods, generating compilations with superior narrative coherence and logical coherence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce CineBench, the first benchmark for instruction-driven cinematic video compilation featuring diverse user instructions and high-quality ground-truth compilations annotated by professional editors. It also presents CineAgents, a multi-agent system reformulating the task into a 'design-and-compose' paradigm that performs script reverse-engineering to build a hierarchical narrative memory and uses iterative narrative planning to refine a creative blueprint, claiming through extensive experiments that it significantly outperforms existing methods in narrative coherence and logical coherence.

Significance. If the empirical claims hold, this work supplies a needed benchmark resource and a structured multi-agent method for handling narrative and temporal issues in video compilation, which could support future research on adapting long-form cinematic content. The creation of a professionally annotated benchmark and the explicit 'design-and-compose' reformulation are concrete contributions that enable falsifiable comparisons.

major comments (2)
  1. [CineBench] CineBench section: the superiority claims rest on professional-editor annotations serving as objective ground truth for narrative and logical coherence across diverse instructions, yet no inter-annotator agreement statistics, coverage analysis of instruction types (e.g., non-linear narratives or cross-shot constraints), or sensitivity tests are reported; this is load-bearing for validating that higher scores reflect genuine coherence gains rather than style imitation.
  2. [Experiments] Experiments section: the abstract and results assert that CineAgents 'significantly outperforms existing methods' with 'superior narrative coherence and logical coherence,' but no concrete metrics, baseline implementations, coherence scoring protocol, or statistical tests are detailed in the provided description, preventing verification of the central empirical claim.
minor comments (2)
  1. [Introduction] The 'design-and-compose' paradigm is referenced repeatedly without a concise formal definition or explicit contrast to prior multi-agent decomposition strategies.
  2. [Figure 2] Figure captions for the system architecture would be clearer if they explicitly labeled the hierarchical memory construction and iterative refinement loops.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and will incorporate revisions to strengthen the validation of CineBench and the transparency of our experimental reporting.

read point-by-point responses
  1. Referee: [CineBench] CineBench section: the superiority claims rest on professional-editor annotations serving as objective ground truth for narrative and logical coherence across diverse instructions, yet no inter-annotator agreement statistics, coverage analysis of instruction types (e.g., non-linear narratives or cross-shot constraints), or sensitivity tests are reported; this is load-bearing for validating that higher scores reflect genuine coherence gains rather than style imitation.

    Authors: We agree that inter-annotator agreement statistics are essential to substantiate the reliability of the professional-editor annotations as ground truth. In the revised manuscript, we will report inter-annotator agreement metrics (e.g., Fleiss' kappa) computed across the annotators for both narrative and logical coherence scores. We will also add a coverage analysis subsection detailing the distribution of instruction types in CineBench, explicitly quantifying the inclusion of non-linear narratives and cross-shot constraints. Sensitivity tests will be included via an analysis of how minor variations in annotations impact benchmark outcomes. These elements will be added to the CineBench section to directly address the concern. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract and results assert that CineAgents 'significantly outperforms existing methods' with 'superior narrative coherence and logical coherence,' but no concrete metrics, baseline implementations, coherence scoring protocol, or statistical tests are detailed in the provided description, preventing verification of the central empirical claim.

    Authors: The full experiments section provides concrete metrics (narrative coherence and logical coherence scores), descriptions of baseline implementations, and the coherence scoring protocol used by human evaluators. However, to enhance verifiability as noted, we will expand the section with a more explicit step-by-step description of the scoring protocol, additional implementation details for all baselines, and statistical significance tests (including p-values from appropriate tests such as paired t-tests). These clarifications and expansions will be integrated into the Experiments section and associated tables. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical evaluation on new benchmark

full rationale

The paper introduces CineBench as a new benchmark with professional-editor annotations and presents CineAgents as a multi-agent system using standard techniques (script reverse-engineering, hierarchical memory, iterative planning). The central claim of outperformance is an empirical comparison on this benchmark, with no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its inputs by construction. No self-definitional steps, uniqueness theorems from prior author work, or ansatzes smuggled via citation are present in the provided text. The derivation is self-contained as a standard benchmark-plus-method construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, mathematical axioms, or invented physical entities are described. The work relies on standard multi-agent LLM frameworks and professional annotations without introducing new conserved quantities or ungrounded entities.

pith-pipeline@v0.9.0 · 5476 in / 1177 out tokens · 28175 ms · 2026-05-10T15:24:21.411501+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Accessed February 25, 2025 [Online]https://claude.ai/4, 11

    Claude-3.7-sonnet. Accessed February 25, 2025 [Online]https://claude.ai/4, 11

  2. [2]

    [Online]https://github.com/deepinsight/insightface/7, 8

    Insightface. [Online]https://github.com/deepinsight/insightface/7, 8

  3. [3]

    [Online]https://v.qq.com/1

    Tencent video. [Online]https://v.qq.com/1

  4. [4]

    [Online]https://www.tiktok.com/1

    Tiktok short. [Online]https://www.tiktok.com/1

  5. [5]

    [Online]https://www.youtube.com/shorts/1

    Youtube short. [Online]https://www.youtube.com/shorts/1

  6. [6]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 4

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Argaw, D.M., Soldan, M., Pardo, A., Zhao, C., Heilbron, F.C., Chung, J.S., Ghanem, B.: Towards automated movie trailer generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7445– 7454 (2024) 2, 4

  8. [8]

    Whisperx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747, 2023

    Bain, M., Huh, J., Han, T., Zisserman, A.: Whisperx: Time-accurate speech tran- scription of long-form audio. arXiv preprint arXiv:2303.00747 (2023) 8

  9. [9]

    arXiv preprint arXiv:2502.07096 (2025) 4

    Barua, A., Benharrak, K., Chen, M., Huh, M., Pavel, A.: Lotus: Creating short videos from long videos with abstractive and extractive summarization. arXiv preprint arXiv:2502.07096 (2025) 4

  10. [10]

    NeurIPS (2020) 4

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. NeurIPS (2020) 4

  11. [11]

    In: ECCV (2024) 4

    Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. In: ECCV (2024) 4

  12. [12]

    In: CVPR (2023) 7

    Chen, W., Xu, X., Jia, J., Luo, H., Wang, Y., Wang, F., Jin, R., Sun, X.: Beyond appearance: a semantic controllable self-supervised learning framework for human- centric visual tasks. In: CVPR (2023) 7

  13. [13]

    Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., Zhang, Y., Lv, W., Huang, K., Zhang, Y., Zhang, J., Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report (2025),https://arxiv.org/ abs/2507.055958

  14. [14]

    In: CVPR (2023) 2, 4

    Gan, B., Shu, X., Qiao, R., Wu, H., Chen, K., Li, H., Ren, B.: Collaborative noisy label cleaner: Learning scene-aware trailers for multi-modal highlight detection in movies. In: CVPR (2023) 2, 4

  15. [15]

    In: CVPR (2016) 8 16 Zhang et al

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 8 16 Zhang et al

  16. [16]

    In: ICLR (2023) 4, 11

    Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S.K.S., Lin, Z., et al.: Metagpt: Meta programming for a multi-agent collaborative framework. In: ICLR (2023) 4, 11

  17. [17]

    In: ACMMM (2023) 2, 4

    Hu, P., Xiao, N., Li, F., Chen, Y., Huang, R.: A reinforcement learning-based auto- matic video editing method using pre-trained vision-language model. In: ACMMM (2023) 2, 4

  18. [18]

    In: CVPR (2024) 6

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video genera- tive models. In: CVPR (2024) 6

  19. [19]

    In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

    Huber, B., Shin, H.V., Russell, B., Wang, O., Mysore, G.J.: B-script: Transcript- based b-roll video editing with recommendations. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. pp. 1–11 (2019) 2, 4

  20. [20]

    In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023) 3

    Huh, M., Yang, S., Peng, Y.H., Chen, X., Kim, Y.H., Pavel, A.: Avscript: Accessible video editing with audio-visual scripts. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023) 3

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Koorathota, S., Adelman, P., Cotton, K., Sajda, P.: Editing like humans: a contex- tual, multimodal framework for automated video editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1701– 1709 (2021) 2, 4

  22. [22]

    biometrics (1977) 6

    Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics (1977) 6

  23. [23]

    In: CVPR (2024) 4

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 4

  24. [24]

    In: NeurIPS (2023) 4

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 4

  25. [25]

    arXiv preprint arXiv:2502.17038 (2025) 1

    Lu, J., Xiao, M., Wang, W., Du, Y., Wu, Z., Hua, C.: Multi-modal and metadata capture model for micro video popularity prediction. arXiv preprint arXiv:2502.17038 (2025) 1

  26. [26]

    arXiv preprint arXiv:2201.05277 (2022) 9

    Mun, J., Shin, M., Han, G., Lee, S., Ha, S., Lee, J., Kim, E.S.: Boundary-aware self- supervised learning for video scene segmentation. arXiv preprint arXiv:2201.05277 (2022) 9

  27. [27]

    In: ICCV (2021) 2

    Pardo, A., Caba, F., Alcázar, J.L., Thabet, A.K., Ghanem, B.: Learning to cut by watching movies. In: ICCV (2021) 2

  28. [28]

    arXiv preprint arXiv:2411.12293 (2024) 2, 4

    Pardo, A., Wang, J.H., Ghanem, B., Sivic, J., Russell, B., Heilbron, F.C.: Gen- erative timelines for instructed visual assembly. arXiv preprint arXiv:2411.12293 (2024) 2, 4

  29. [29]

    In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology

    Pavel, A., Reyes, G., Bigham, J.P.: Rescribe: Authoring and automatically editing audio descriptions. In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. pp. 747–759 (2020) 3

  30. [30]

    arXiv preprint arXiv:2504.19894 (2025) 4

    Phung, Q., Mai, L., Heilbron, F.D.C., Liu, F., Huang, J.B., Ham, C.: Cineverse: Consistent keyframe synthesis for cinematic scene composition. arXiv preprint arXiv:2504.19894 (2025) 4

  31. [31]

    In: Advances in Natu- ralComputation,FuzzySystemsandKnowledgeDiscovery:Volume1.pp.361–368

    Podlesnyy, S.: Towards data-driven automatic video editing. In: Advances in Natu- ralComputation,FuzzySystemsandKnowledgeDiscovery:Volume1.pp.361–368. Springer (2020) 2, 4

  32. [32]

    In: ICML (2023) 8

    Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: ICML (2023) 8

  33. [33]

    Sandoval-Castaneda, M., Russell, B., Sivic, J., Shakhnarovich, G., Caba Heilbron, F.: Editduet: A multi-agent system for video non-linear editing. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers (2025) 14 Instruction-driven Cinematic Video Compilation 17

  34. [34]

    want and how far is AI from delivering them (2021) 2

    Soe, T.H., Slavkovik, M.: Ai video editing tools. want and how far is AI from delivering them (2021) 2

  35. [35]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 4, 6, 9, 10, 11, 14

  36. [36]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 4

  37. [37]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 4

  38. [38]

    In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology

    Truong, A., Berthouzoz, F., Li, W., Agrawala, M.: Quickcut: An interactive tool for editing narrated video. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology. pp. 497–507 (2016) 2, 3

  39. [39]

    Journal of the ACM (JACM) (1974) 6

    Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM (JACM) (1974) 6

  40. [40]

    In: Proceedings of the 29th International Conference on Intelligent User Interfaces

    Wang, B., Li, Y., Lv, Z., Xia, H., Xu, Y., Sodhi, R.: Lave: Llm-powered agent assistance and language augmentation for video editing. In: Proceedings of the 29th International Conference on Intelligent User Interfaces. pp. 699–714 (2024) 11

  41. [41]

    In: ICASSP (2023) 8

    Wang, H., Liang, C., Wang, S., Chen, Z., Zhang, B., Xiang, X., Deng, Y., Qian, Y.: Wespeaker: A research and production oriented speaker embedding learning toolkit. In: ICASSP (2023) 8

  42. [42]

    In: ECCV (2020) 2, 4

    Wang, L., Liu, D., Puri, R., Metaxas, D.N.: Learning trailer moments in full-length movies with co-contrastive attention. In: ECCV (2020) 2, 4

  43. [43]

    ACM Trans

    Wang, M., Yang, G.W., Hu, S.M., Yau, S.T., Shamir, A., et al.: Write-a-video: computational video montage from themed text. ACM Trans. Graph.38(6), 177– 1 (2019) 3

  44. [44]

    arXiv preprint arXiv:2507.02790 (2025) 2, 4, 14

    Wang, X., Li, X., Wei, Y., Song, X., Song, Y., Xia, X., Zeng, F., Chen, Z., Liu, L., Xu, G., et al.: From long videos to engaging clips: A human-inspired video editing framework with multimodal narrative understanding. arXiv preprint arXiv:2507.02790 (2025) 2, 4, 14

  45. [45]

    NeurIPS (2022) 11

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022) 11

  46. [46]

    arXiv preprint arXiv:2503.07314 (2025)

    Wu, W., Zhu, Z., Shou, M.Z.: Automated movie generation via multi-agent cot planning. arXiv preprint arXiv:2503.07314 (2025) 4

  47. [47]

    In: ACMMM (2022) 2

    Xiong, Y., Heilbron, F.C., Lin, D.: Transcript to video: Efficient clip sequencing from texts. In: ACMMM (2022) 2

  48. [48]

    In: SIGGRAPH Asia (2024) 4

    Xu, Z., Wang, J., Wang, L., Li, Z., Shi, S., Hu, B., Zhang, M.: Filmagent: Au- tomating virtual film production through a multi-agent collaborative framework. In: SIGGRAPH Asia (2024) 4

  49. [49]

    arXiv preprint arXiv:2512.12372 (2025) 4

    Zhang, P., Jia, Z., Liu, K., Weng, S., Li, S., Shi, B.: Stage: Storyboard-anchored generation for cinematic multi-shot narrative. arXiv preprint arXiv:2512.12372 (2025) 4

  50. [50]

    Zhang, P., Weng, S., Tang, J., Li, S., Shi, B.: Towards deeper emotional reflection: Craftingaffectiveimagefilterswithgenerativepriors.IEEEtransactionsonpattern analysis and machine intelligence (2025) 4

  51. [51]

    In: NeurIPS (2024) 4 18 Zhang et al

    Zhang, Y., Sun, R., Chen, Y., Pfister, T., Zhang, R., Arik, S.: Chain of agents: Large language models collaborating on long-context tasks. In: NeurIPS (2024) 4 18 Zhang et al

  52. [52]

    In: NeurIPS NextVid Workshop Oral (2025) 4

    Zheng, M., Xu, Y., Huang, H., Ma, X., Liu, Y., Shu, W., Pang, Y., Tang, F., Chen, Q., Yang, H., Sernam, L.: Videogen-of-thought: A collaborative framework for multi-shot video generation. In: NeurIPS NextVid Workshop Oral (2025) 4

  53. [53]

    Zhou, H., Huang, L., Wu, S., Xia, L., Huang, C., et al.: Videoagent: All-in-one agentic framework for video understanding and editing 11

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhu, W., Huang, Y., Xie, X., Liu, W., Deng, J., Zhang, D., Wang, Z., Liu, J.: Autoshot: A short video dataset and state-of-the-art shot boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2238–2247 (2023) 9