pith. machine review for the scientific record. sign in

arxiv: 2604.23789 · v2 · submitted 2026-04-26 · 💻 cs.CV

Recognition: no theorem link

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords MuSS datasetmulti-shot video generationsubject-to-videocinematic narrative benchmarkcopy-paste dilemmaprogressive captioningcross-shot matchingACP-Var metric
0
0 comments X

The pith

A new dataset from 3000 movies uses progressive captions and cross-shot matching to let AI models generate coherent multi-shot videos without copy-paste shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds MuSS, a large dataset drawn from thousands of films, specifically to train models on multi-shot subject-to-video tasks that require real narrative flow. Existing approaches fail because models either lose story logic across shots or simply paste the subject image without proper motion or 3D consistency. To fix this, the authors create a captioning process that first locks in accurate local descriptions per shot and then aligns them globally, paired with a matching step that blocks trivial copying. They also release a benchmark and ACP-Var metric that test whether generated sequences actually tell a continuous story while preserving subject identity in three dimensions. Experiments show models trained with MuSS outperform baselines on narrative quality and identity holding.

Core claim

MuSS is constructed via a progressive captioning pipeline that secures shot-level accuracy first and then global narrative coherence, together with a cross-shot matching mechanism that removes the copy-paste shortcut; when used to augment training, this produces models that reach state-of-the-art performance on continuous storytelling and cross-shot subject identity, as measured by the new Cinematic Narrative Benchmark and its Anti-Copy-Paste Variance metric.

What carries the argument

The progressive captioning pipeline plus cross-shot matching mechanism: the pipeline builds accurate local captions before enforcing story-wide consistency, while matching prevents models from simply reusing the input subject image across shots.

If this is right

  • Video foundation models trained on MuSS produce multi-shot sequences with measurably better narrative continuity than single-shot or un-augmented baselines.
  • The ACP-Var metric provides a quantitative way to detect when a generator has collapsed into trivial 2D sticker behavior instead of 3D-consistent storytelling.
  • Baselines without MuSS either lose subject identity across shots or fail to maintain story logic, confirming the three core dataset challenges listed in the paper.
  • The dual-track design of MuSS supports both complex montage transitions and subject-centric narratives, allowing the same data to serve multiple generation modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the matching mechanism generalizes, similar conflict-removal pipelines could be applied to other video tasks such as text-to-video or image-to-video with long sequences.
  • The benchmark's focus on visual-logic-driven evaluation suggests future work could add metrics for dialogue consistency or emotional arc across shots.
  • Extending MuSS with more diverse film sources or synthetic augmentations might further reduce any remaining domain biases from the original 3000-movie corpus.

Load-bearing premise

The progressive captioning pipeline and cross-shot matching mechanism actually remove spatiotemporal conflicts and the copy-paste shortcut without introducing new biases or artifacts into the dataset.

What would settle it

Train a model on MuSS and test it on the Cinematic Narrative Benchmark; if the generated videos still show high copy-paste rates, broken narrative logic, or low ACP-Var scores that match non-MuSS baselines, the central claim fails.

Figures

Figures reproduced from arXiv: 2604.23789 by Bingyan Liu, Di Wu, Haojie Zhang, Linjie Zhong, Nanqing Liu, Xingsong Ye, Yaling Liang, Yuancheng Wei.

Figure 1
Figure 1. Figure 1: Overview of the MuSS dataset construction. (Top) Complex Cinematic Narrative: Progressive captioning resolves view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MuSS dataset statistics. (a) Video clip duration distribution. (b) Caption length distribution. (c) Caption view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the MuSS dataset curation methodology. (a) Multi-Shot Video and Coherent Captioning: Transforms view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the Cinematic Narrative Benchmark. The evaluation suite employs a novel Visual-Logic Driven paradigm, view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on the Cinematic Narrative Benchmark. (Left) Track 1: Evaluating multi-shot consistency across view at source ↗
Figure 6
Figure 6. Figure 6: Data examples for Track 1 (Complex Cinematic Narratives). This visualization showcases the curated keyframes view at source ↗
Figure 7
Figure 7. Figure 7: Data examples for Track 2 (Subject-Centric Narratives). This figure details the data structure by presenting the view at source ↗
Figure 8
Figure 8. Figure 8: Raw cinematic transitions for Track 1 (Complex Cinematic Narratives). These unannotated screenshot sequences view at source ↗
Figure 9
Figure 9. Figure 9: Raw cinematic sequences for Track 2 (Subject-Centric Narratives). These extensive multi-shot screenshots demonstrate view at source ↗
read the original abstract

While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the "copy-paste" dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces MuSS, a large-scale dual-track dataset sourced from over 3,000 movies for multi-shot subject-to-video (S2V) generation. It addresses narrative logic, spatiotemporal alignment conflicts, and the copy-paste shortcut via a progressive captioning pipeline (local shot accuracy followed by global coherence) and a cross-shot matching mechanism. The work also proposes the Cinematic Narrative Benchmark with a visual-logic paradigm and the novel Anti-Copy-Paste Variance (ACP-Var) metric. Experiments claim that MuSS-augmented models achieve SOTA narrative effectiveness and cross-shot identity preservation compared to baselines that struggle with continuous logic or degenerate to 2D stickers.

Significance. If the dataset construction pipeline demonstrably resolves the three core challenges without introducing new biases or artifacts, MuSS and its benchmark would provide a valuable, large-scale resource for advancing cinematic multi-shot video generation. The movie-sourced scale and explicit support for montage transitions represent a concrete empirical contribution that could enable more rigorous evaluation of subject consistency and storytelling in video models.

major comments (3)
  1. [§3.2] §3.2 (Progressive Captioning Pipeline): The two-stage process is described at a high level, but the manuscript provides no quantitative before/after statistics on spatiotemporal conflict rates, no ablation on caption accuracy, and no human or automated verification scores confirming that local-to-global coherence eliminates conflicts rather than trading one set of inconsistencies for another. This directly underpins the claim that the dataset solves the alignment problem and supports the later ACP-Var results.
  2. [§3.3] §3.3 (Cross-Shot Matching Mechanism): No analysis or ablation is reported on whether the matching step introduces systematic biases in subject pose, lighting, camera angle, or motion statistics. Without such checks, it is unclear whether the mechanism truly eradicates the copy-paste shortcut or merely masks it in ways that the ACP-Var metric (defined in §4) may not detect, weakening the attribution of SOTA gains to the dataset.
  3. [§5] §5 (Experiments and Benchmark Results): The SOTA claims for narrative effectiveness and identity preservation rest on comparisons with baselines, yet the text lacks details on baseline re-implementations, statistical significance tests, or controls for dataset size effects. If the reported improvements are driven by unverified pipeline artifacts, the central empirical conclusion does not hold.
minor comments (3)
  1. [Figure 3] Figure 3 (dataset examples) would benefit from explicit annotations highlighting the montage transitions and cross-shot subject consistency that the benchmark is designed to test.
  2. [§4.2] The definition of ACP-Var in §4.2 uses variance over 3D structural features; clarify whether these features are extracted from ground-truth 3D reconstructions or estimated via off-the-shelf models, as this affects reproducibility.
  3. [Related Work] A few citations to prior multi-shot video datasets (e.g., in the related work section) appear incomplete; ensure all referenced works have full bibliographic details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to specific revisions that will strengthen the empirical support for our claims without misrepresenting the current work.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Progressive Captioning Pipeline): The two-stage process is described at a high level, but the manuscript provides no quantitative before/after statistics on spatiotemporal conflict rates, no ablation on caption accuracy, and no human or automated verification scores confirming that local-to-global coherence eliminates conflicts rather than trading one set of inconsistencies for another. This directly underpins the claim that the dataset solves the alignment problem and supports the later ACP-Var results.

    Authors: We agree that the current description of the progressive captioning pipeline would benefit from quantitative validation. In the revised manuscript, we will add before/after statistics on spatiotemporal conflict rates, an ablation study on caption accuracy, and both human and automated verification scores. These additions will demonstrate that the local-to-global coherence step resolves conflicts without introducing new inconsistencies, thereby providing stronger support for the dataset's role in addressing alignment issues and the subsequent ACP-Var results. revision: yes

  2. Referee: [§3.3] §3.3 (Cross-Shot Matching Mechanism): No analysis or ablation is reported on whether the matching step introduces systematic biases in subject pose, lighting, camera angle, or motion statistics. Without such checks, it is unclear whether the mechanism truly eradicates the copy-paste shortcut or merely masks it in ways that the ACP-Var metric (defined in §4) may not detect, weakening the attribution of SOTA gains to the dataset.

    Authors: We acknowledge that additional analysis of the cross-shot matching mechanism is warranted to rule out systematic biases. In the revision, we will include ablations examining effects on subject pose, lighting, camera angle, and motion statistics. We will also expand the discussion of the ACP-Var metric to show how it is designed to detect residual copy-paste artifacts, providing evidence that the mechanism eradicates rather than masks the shortcut. This will strengthen the link between the dataset construction and the reported SOTA gains. revision: yes

  3. Referee: [§5] §5 (Experiments and Benchmark Results): The SOTA claims for narrative effectiveness and identity preservation rest on comparisons with baselines, yet the text lacks details on baseline re-implementations, statistical significance tests, or controls for dataset size effects. If the reported improvements are driven by unverified pipeline artifacts, the central empirical conclusion does not hold.

    Authors: We agree that greater transparency and rigor in the experimental section are needed to substantiate the SOTA claims. In the revised manuscript, we will provide detailed descriptions of baseline re-implementations, report statistical significance tests, and include controls for dataset size effects by evaluating models trained on MuSS subsets of varying scales. These additions will confirm that the observed improvements in narrative effectiveness and identity preservation are robust and not attributable to unverified artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with independent experimental validation

full rationale

The paper presents MuSS as a sourced dataset from 3000+ movies, built with a progressive captioning pipeline and cross-shot matching mechanism, plus a new benchmark and ACP-Var metric. No mathematical derivations, equations, or predictions are described that reduce by construction to fitted inputs or self-citations. Claims of SOTA performance rest on external experiments rather than self-referential definitions, making the contribution self-contained as an empirical release without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that movie footage provides authentic examples of narrative logic and that the custom pipeline successfully resolves the stated challenges.

axioms (1)
  • domain assumption Movie clips from over 3,000 films provide representative examples of complex multi-shot narrative logic and spatiotemporal alignments suitable for training subject-to-video models.
    The dataset construction begins from this premise to address the three core challenges listed in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1212 out tokens · 50482 ms · 2026-05-12T00:45:48.751434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 14 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631(2025)

  3. [3]

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. InProceedings of the IEEE/CVF international conference on computer vision. 1728–1738

  4. [4]

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. 2024. Video generation models as world simulators.OpenAI Blog1, 8 (2024), 1

  5. [5]

    Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, et al. 2025. Mixture of Contexts for Long Video Generation.arXiv preprint arXiv:2508.21058(2025)

  6. [6]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9650–9660

  7. [7]

    Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Peter Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. 2025. MultiRef: Controllable Image Generation with Multiple Visual References. InProceedings of the 33rd ACM International Conference on Multimedia. 13325–13331

  8. [8]

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690–4699

  9. [9]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783 (2024)

  10. [10]

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2023. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725(2023)

  11. [11]

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. 2025. Long Context Tuning for Video Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 17281– 17291

  12. [12]

    Mingfei Han, Linjie Yang, Xiaojun Chang, and Heng Wang. 2023. Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos.arXiv preprint arXiv:2312.103002 (2023)

  13. [13]

    Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Qiao Yu, Wanli Ouyang, and Ziwei Liu. 2025. Cut2Next: Generating Next Shot via In-Context Tuning. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11

  14. [14]

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuan- han Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. 2024. VBench: Comprehensive Benchmark Suite for Video Generative Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21807– 21818

  15. [15]

    Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, and Zhendong Mao. 2025. MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation.arXiv preprint arXiv:2510.18692 (2025)

  16. [16]

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. 2025. VACE: All-in-One Video Creation and Editing. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17191–17202

  17. [17]

    2024.Ultralytics YOLO11

    Glenn Jocher and Jing Qiu. 2024.Ultralytics YOLO11. https://github.com/ ultralytics/ultralytics

  18. [18]

    Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M Rehg, and Tobias Hinz. 2025. ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 28405–28415

  19. [19]

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. 2024. HunyuanVideo: A Systematic Framework For Large Video Generative Models.arXiv preprint arXiv:2412.03603 (2024)

  20. [20]

    Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. 2025. OpenHumanVid: A Large- Scale High-Quality Dataset for Enhancing Human-Centric Video Generation. In Proceedings of the Computer Vision and Pattern Recognition Conference. 7752–7762

  21. [21]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. DeepSeek-V3 Technical Report.arXiv preprint arXiv:2412.19437(2024)

  22. [22]

    Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. 2025. Phantom: Subject-consistent video generation via cross-modal alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14951–14961

  23. [23]

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding DINO: Mar- rying DINO with Grounded Pre-Training for Open-Set Object Detection. In European conference on computer vision. Springer, 38–55

  24. [24]

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. 2024. EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22139–22149

  25. [25]

    Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, et al. 2025. OpenSub- ject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation.arXiv preprint arXiv:2512.08294(2025)

  26. [26]

    Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, and Yuyin Zhou. 2024. Story-Adapter: A Training-free Iterative Framework for Long Story Visualization.arXiv preprint arXiv:2410.06244(2024)

  27. [27]

    Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, et al . 2025. HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives.arXiv preprint arXiv:2510.20822(2025)

  28. [28]

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. 2024. OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation.arXiv preprint arXiv:2407.02371(2024)

  29. [29]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193(2023)

  30. [30]

    Lu Qiu, Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. 2025. Ani- meShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Gener- ation.arXiv preprint arXiv:2506.03126(2025)

  31. [31]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  32. [32]

    InInternational conference on machine learning

    Learning Transferable Visual Models From Natural Language Supervision. InInternational conference on machine learning. PmLR, 8748–8763

  33. [33]

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. 2024. SAM 2: Segment Anything in Images and Videos.arXiv preprint arXiv:2408.00714(2024)

  34. [34]

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22500–22510

  35. [35]

    Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. 2026. MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation.arXiv preprint arXiv:2602.23969(2026)

  36. [36]

    Tomás Soucek and Jakub Lokoc. 2024. TransNet V2: An effective deep network architecture for fast shot transition detection. InProceedings of the 32nd ACM International Conference on Multimedia. 11218–11221

  37. [37]

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

  38. [38]

    Zachary Teed and Jia Deng. 2020. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. InEuropean conference on computer vision. Springer, 402–419

  39. [39]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and Advanced Large-Scale Video Generative Models.arXiv preprint arXiv:2503.20314(2025)

  40. [40]

    Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan, Yachuang Feng, Bing Deng, and Jieping Ye. 2025. EchoShot: Multi-Shot Portrait Video Gen- eration. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  41. [41]

    Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xin- tao Wang, Pengfei Wan, Kun Gai, and Xu Jia. 2025. MultiShotMaster: A Control- lable Multi-Shot Video Generation Framework.arXiv preprint arXiv:2512.03041 (2025)

  42. [42]

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. 2023. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation.arXiv preprint Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Zhang and Wu, et al. arXiv:2307.06942(2023)

  43. [43]

    Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, and Xinyuan Chen. 2025. CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models.arXiv preprint arXiv:2508.11484(2025)

  44. [44]

    Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. 2025. Cap- tain Cinema: Towards Short Movie Generation. InThe Fourteenth International Conference on Learning Representations

  45. [45]

    Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, and Xihui Liu. 2024. LVD-2M: A Long-take Video Dataset with Temporally Dense Captions. Advances in Neural Information Processing Systems37 (2024), 16623–16644

  46. [46]

    Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xian- fang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, et al. 2025. WithAnyone: Towards Controllable and ID Consistent Image Generation.arXiv preprint arXiv:2510.14975 (2025)

  47. [47]

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. InProceedings of the 2021 conference on empirical methods in natural language processing. 6787– 6800

  48. [48]

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. InProceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296

  49. [49]

    Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. 2023. Effective Whole-Body Pose Estimation with Two-Stages Distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4210–4220

  50. [50]

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models.arXiv preprint arXiv:2308.06721(2023)

  51. [51]

    Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. 2025. OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation.arXiv preprint arXiv:2505.20292(2025)

  52. [52]

    Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. 2025. Identity-Preserving Text-to-Video Generation by Frequency Decomposition. InProceedings of the Computer Vision and Pattern Recognition Conference. 12978–12988

  53. [53]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid Loss for Language Image Pre-Training. InProceedings of the IEEE/CVF international conference on computer vision. 11975–11986

  54. [54]

    Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, and Jiaya Jia. 2025. MagicMirror: ID-Preserved Video Generation in Video Diffusion Trans- formers. InProceedings of the IEEE/CVF International Conference on Computer Vision. 14464–14474

  55. [55]

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2024. LLaVA-Video: Video Instruction Tuning With Synthetic Data.arXiv preprint arXiv:2410.02713(2024)

  56. [56]

    Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, and Meng Wang. 2025. Kaleido: Open-Sourced Multi- Subject Reference Video Generation Model.arXiv preprint arXiv:2510.18573 (2025)

  57. [57]

    Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, and Chunhua Shen. 2024. MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence.arXiv preprint arXiv:2407.16655(2024)

  58. [58]

    Jiayi Zheng and Xiaodong Cun. 2025. FairyGen: Storied Cartoon Video from a Single Child-Drawn Character. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11

  59. [59]

    Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou

  60. [60]

    StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation.Advances in Neural Information Processing Systems37 (2024), 110315– 110340

  61. [61]

    Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, et al . 2025. ViS- toryBench: Comprehensive Benchmark Suite for Story Visualization.arXiv preprint arXiv:2505.24862(2025). MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation Con...

  62. [62]

    copy-paste

    using latent sequence concatenation to inject identity priors. This section provides the precise training hyperparameters utilized to achieve convergence: • Optimizer: AdamW ( 𝛽1 = 0.9, 𝛽2 = 0.999, weight decay =10 −4). • Learning Rate:1 × 10−5 with a linear warmup of 2,000 steps. •Total Training Steps: 50,000. • Resolution & Framerate:832 × 480spatial re...