pith. sign in

arxiv: 2606.21661 · v1 · pith:KKJKI3VSnew · submitted 2026-06-19 · 💻 cs.CV

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Pith reviewed 2026-06-26 14:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-shot video generationmemory-driven video synthesisboundary-aware gatingcross-shot coherenceaudio-video generationcinematic shot transitionsspeaker identity preservation
0
0 comments X

The pith

UnityShots generates coherent multi-shot videos by updating two fixed-size memory slots at each cut with a boundary-aware gate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a memory-driven system for producing audio-video sequences that stay consistent across multiple shots. It keeps subject appearance, scene context, and speaker identity intact without growing memory banks or restricting to fixed sequence lengths. The approach relies on a long-term slot anchored to the first shot and a short-term slot for the prior shot, both refreshed at cuts by a gate that combines visual cut detection with beat signals from audio. A reference speaker token handles vocal continuity separately. Tests across image-to-video, text-to-video, and reference-to-video modes show stronger cross-shot metrics than open-source alternatives.

Core claim

UnityShots maintains two fixed-size memory slots on the LTX-2.3 backbone, a long-term memory slot anchored to the opening shot and a short-term memory slot holding the preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals, while injecting a reference speaker token for audio and learning a discrete cut-type prior via AdaLN for transition control.

What carries the argument

Two fixed-size memory slots (long-term anchored to opening shot, short-term to preceding tail) updated by a boundary-conditioned gate fusing visual cut probability and beat-tracker signals.

If this is right

  • Multi-shot generation scales without linear memory growth or fixed-length training constraints.
  • Speaker identity holds across shots through reference tokens without maintaining an audio memory bank.
  • A learned cut-type prior provides direct control over transition strength during inference.
  • A new benchmark of 200 multi-cultural sequences with per-shot references and boundary labels supports standardized evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed-slot design may support extension to longer-form content if most cinematic cuts rely on short-range context.
  • The same gate structure could be tested on non-cinematic domains such as instructional or documentary video where continuity rules differ.
  • Replacing the beat-tracker input with other temporal signals might adapt the method to silent or music-free generation tasks.

Load-bearing premise

The boundary-conditioned gate can preserve subject appearance, scene context, and speaker identity across cuts using only two fixed-size memory slots updated at every cut.

What would settle it

A controlled comparison measuring cross-shot coherence metrics on sequences with five or more shots when the boundary gate is replaced by uniform random updates or when one memory slot is removed.

Figures

Figures reproduced from arXiv: 2606.21661 by Bin Xia, Jiahao Wang, Jiaya Jia, Jiehui Huang, Meng Chu, Pengfei Wan, Xin Tao, Xu He, Yuechen Zhang, Zhenchao Tang.

Figure 1
Figure 1. Figure 1: UnityShots generates consistent multi-shot audio-video sequences across diverse cultural contexts. Each row shows shots from the proposed multi-cultural benchmark; the reference identity portrait (leftmost) is preserved through hard cuts by the dual-slot memory bank and reference audio anchor. Upper left: UnityShots pairs a long-term (LTM) slot anchored to the opening shot with a short￾term (STM) slot capt… view at source ↗
Figure 2
Figure 2. Figure 2: UnityShots architecture. Left: Strata-RoPE partitions the temporal band of the video stream into three non-overlapping strata for the long-term memory (LTM) slot, the short-term memory (STM) slot, and the current shot. Right: at each shot boundary the composite signal bN produces two gating coefficients gltm and gstm that update the two video memory slots before the dual-stream DiT denoises the current sho… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on the multi-cultural benchmark. Columns show LTX-2 (no memory), ID-LoRA, Kling, and Uni￾tyShots(I2V) on 3-shot and 5-shot sequences from diverse cultural contexts. UnityShots follows per-shot prompts while maintaining stable subject identity across all cuts. LTX-2 and ID-LoRA accumulate visible drift from shot 2 onward, and Kling, despite high per-frame fidelity, shows subject-frami… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation visualization on representative multi-cultural [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison in T2V mode. Each column shows a different method generating a 4-shot sequence from text prompts alone. UnityShots(T2V) maintains consistent subject appearance and scene context across all shots while producing synchronized audio. HoloCine achieves competitive per-shot visual quality but lacks a cross-shot memory mechanism and generates no audio track. Open￾source T2V baselines exhib… view at source ↗
Figure 6
Figure 6. Figure 6: Training and benchmark construction pipelines. Top: the 7-stage training pipeline processes raw cinematic and music￾video footage through shot segmentation (TransNetv2), subtitle removal, audio diarisation (Whisper/DEMUCS), multimodal captioning (Qwen3-VL/Qwen3-Omni [2]), character anchoring, record merging, and quality filtering, yielding 146k annotated shot records. Bottom: the benchmark pipeline generat… view at source ↗
Figure 7
Figure 7. Figure 7: Multi-cultural benchmark overview. Representative reference portraits, per-shot first frames, and generated sequences from six ethnic regions covered by the 200-sequence benchmark. Each row shows a distinct cultural context; characters share identity and voice across all shots in a sequence. 0 1 2 3 4 5 6 7 8 Shot index 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 RIC-Subj (running mean, w=3) LTM-anchored opening (… view at source ↗
Figure 8
Figure 8. Figure 8: Long-sequence identity preservation (left) and human-evaluation results (right). See Sections [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of $200$ multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3. It maintains subject/scene/speaker consistency via two fixed-size memory slots (LTM anchored to the opening shot and STM holding the prior tail) updated at each cut by a boundary-conditioned gate fusing visual cut probability and beat-tracker signals; audio uses a reference speaker token; a discrete cut-type prior is learned via AdaLN for inference-time control. The work also releases a benchmark of 200 multi-cultural multi-shot sequences with per-shot references and transition labels, claiming that UnityShots outperforms open-source baselines on all cross-shot coherence metrics and matches the strongest closed-source systems across I2V/T2V/R2V conditioning modes.

Significance. If the two-slot memory architecture demonstrably preserves coherence without degradation over long sequences, the approach would offer a scalable alternative to linear memory banks or fixed-length end-to-end training. The release of the 200-sequence benchmark with per-boundary labels and multi-cultural coverage is a concrete positive contribution that could support future work on multi-shot consistency.

major comments (1)
  1. [Architecture description] Architecture description: the central claim that subject appearance, scene context, and speaker identity are maintained across cuts (including sequences of 10 or more shots) rests on the two fixed-size slots (LTM anchored to shot 1 + STM = prior tail) updated by the boundary-conditioned gate. For sequences exceeding ~5 shots, intermediate context from shots 3–8 must be discarded or compressed into the opening-shot LTM; no ablation is reported showing that coherence metrics remain stable once the number of cuts exceeds the two-slot capacity. This directly underpins the cross-shot coherence claims.
minor comments (1)
  1. [Abstract] Abstract: quantitative results, ablation tables, and error analysis are absent, making it impossible to assess the magnitude of the reported gains over baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of validating long-sequence coherence under the two-slot memory design. We address the concern directly below and commit to strengthening the empirical support in revision.

read point-by-point responses
  1. Referee: [Architecture description] Architecture description: the central claim that subject appearance, scene context, and speaker identity are maintained across cuts (including sequences of 10 or more shots) rests on the two fixed-size slots (LTM anchored to shot 1 + STM = prior tail) updated by the boundary-conditioned gate. For sequences exceeding ~5 shots, intermediate context from shots 3–8 must be discarded or compressed into the opening-shot LTM; no ablation is reported showing that coherence metrics remain stable once the number of cuts exceeds the two-slot capacity. This directly underpins the cross-shot coherence claims.

    Authors: We agree that an explicit ablation on sequence length is necessary to substantiate the claim that coherence holds for 10+ shots. The two-slot architecture is designed so that the LTM retains an anchor representation of the initial subject/scene/speaker while the boundary-conditioned gate selectively compresses and updates information into the STM at each cut; however, we did not report a controlled study varying shot count in the original submission. In the revised manuscript we will add an ablation evaluating all cross-shot coherence metrics on held-out sequences of 2, 5, 10, and 15 shots (using the released benchmark), with results demonstrating that performance remains stable beyond the nominal two-slot capacity. We will also include qualitative examples of 12-shot sequences to illustrate retention of identity and context. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture description only

full rationale

The paper presents an engineering system (two fixed-size LTM/STM slots, boundary-conditioned gate fusing cut probability and beat signals, AdaLN cut-type prior, reference speaker token) without any derivation chain, equations, fitted predictions, or first-principles results. No self-citations, ansatzes, or uniqueness theorems are invoked to justify core claims. The abstract and description are self-contained architectural choices evaluated on an external benchmark; nothing reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies insufficient technical detail to identify or audit free parameters, axioms, or invented entities beyond the high-level system components named.

pith-pipeline@v0.9.1-grok · 5821 in / 1153 out tokens · 30466 ms · 2026-06-26T14:29:21.270699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 14 linked inside Pith

  1. [1]

    Onestory: Coherent multi- shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025

    Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xi- aoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapi- tiya, Ding Liu, Sen He, et al. Onestory: Coherent multi- shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025. 2

  2. [2]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  3. [3]

    Whisperx: Time-accurate speech transcription of long- form audio.arXiv preprint arXiv:2303.00747, 2023

    Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisser- man. Whisperx: Time-accurate speech transcription of long- form audio.arXiv preprint arXiv:2303.00747, 2023. 1

  4. [4]

    Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2

  5. [5]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 2

  6. [6]

    Seedance 2.0.https : / / seed

    ByteDance. Seedance 2.0.https : / / seed . bytedance.com/en/seed2, 2026. 2

  7. [7]

    Id-lora: Identity-driven audio-video personalization with in-context lora.arXiv preprint arXiv:2603.10256, 2026

    Aviad Dahan, Moran Yanuka, Noa Kraicer, Lior Wolf, and Raja Giryes. Id-lora: Identity-driven audio-video personalization with in-context lora.arXiv preprint arXiv:2603.10256, 2026. 2, 6, 3

  8. [8]

    Hybrid spectrogram and waveform source separation.arXiv preprint arXiv:2111.03600, 2021

    Alexandre D ´efossez. Hybrid spectrogram and waveform source separation.arXiv preprint arXiv:2111.03600, 2021. 1

  9. [9]

    Clap learning audio concepts from nat- ural language supervision

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from nat- ural language supervision. InICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 1–5. IEEE, 2023. 5

  10. [10]

    Beat this! accurate beat tracking without dbn postprocessing

    Francesco Foscarin, Jan Schl ¨uter, and Gerhard Widmer. Beat this! accurate beat tracking without dbn postprocessing. arXiv preprint arXiv:2407.21658, 2024. 4, 5

  11. [11]

    Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

    Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Songtao Zhao, Qian He, and Xiangwang Hou. Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026. 3, 5, 6, 2

  12. [12]

    Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhi- jie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025. 2

  13. [13]

    Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026. 2, 3, 6

  14. [14]

    Cut2next: Generating next shot via in-context tuning.arXiv preprint arXiv:2508.08244,

    Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Cut2next: Generating next shot via in-context tuning.arXiv preprint arXiv:2508.08244,

  15. [15]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 4

  16. [16]

    Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 2

  17. [17]

    Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 2

  18. [18]

    In-context LoRA for diffusion transformers

    Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context LoRA for diffusion transformers. arXiv preprint arXiv:2410.23775, 2024. 2

  19. [19]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 5, 2

  20. [20]

    Shotadapter: Text-to- multi-shot video generation with diffusion models

    Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Cey- lan, James M Rehg, and Tobias Hinz. Shotadapter: Text-to- multi-shot video generation with diffusion models. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 28405–28415, 2025. 2

  21. [21]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2

  22. [22]

    Kling ai.https://klingai

    Kuaishou Technology. Kling ai.https://klingai. com/, 2026. 2, 5, 6

  23. [23]

    Direct: Video mashup cre- ation via hierarchical multi-agent planning and intent-guided editing.arXiv preprint arXiv:2604.04875, 2026

    Ke Li, Maoliang Li, Jialiang Chen, Jiayu Chen, Zihao Zheng, Shaoqi Wang, and Xiang Chen. Direct: Video mashup cre- ation via hierarchical multi-agent planning and intent-guided editing.arXiv preprint arXiv:2604.04875, 2026. 2, 3

  24. [24]

    Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025

    Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, et al. Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025. 2, 3

  25. [25]

    Audioldm 2: Learning holistic audio gen- eration with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32: 2871–2883, 2024

    Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. Audioldm 2: Learning holistic audio gen- eration with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32: 2871–2883, 2024. 3

  26. [26]

    Ovi: Twin backbone cross-modal fusion for audio-video genera- tion.arXiv preprint arXiv:2510.01284, 2025

    Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video genera- tion.arXiv preprint arXiv:2510.01284, 2025. 6, 2

  27. [27]

    Filmweaver: Weaving consistent multi- shot videos with cache-guided autoregressive diffusion

    Xiangyang Luo, Qingyu Li, Xiaokun Liu, Wenyu Qin, Miao Yang, Meng Wang, Pengfei Wan, Di Zhang, Kun Gai, and Shao-Lun Huang. Filmweaver: Weaving consistent multi- shot videos with cache-guided autoregressive diffusion. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7689–7697, 2026. 2

  28. [28]

    Shotstream: Streaming multi-shot video generation for inter- active storytelling.arXiv preprint arXiv:2603.25746, 2026

    Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for inter- active storytelling.arXiv preprint arXiv:2603.25746, 2026. 2

  29. [29]

    Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

    Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, et al. Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025. 2, 5, 6

  30. [30]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 5

  31. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  32. [32]

    Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  33. [33]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international confer- ence for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020. 5

  34. [34]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  35. [35]

    Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 5

  36. [36]

    Transnet v2: An ef- fective deep network architecture for fast shot transition detection

    Tom ´as Soucek and Jakub Lokoc. Transnet v2: An ef- fective deep network architecture for fast shot transition detection. corr abs/2008.04838 (2020).arXiv preprint arXiv:2008.04838, 2020. 4, 5, 1

  37. [37]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  38. [38]

    Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

    OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026. 6, 2, 3

  39. [39]

    Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound

    Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139, 2025. 5

  40. [40]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

  41. [41]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2

  42. [42]

    Cinemaster: A 3d-aware and controllable frame- work for cinematic text-to-video generation

    Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3d-aware and controllable frame- work for cinematic text-to-video generation. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–10, 2025. 2

  43. [43]

    Multishotmaster: A controllable multi-shot video generation framework.arXiv preprint arXiv:2512.03041,

    Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, and Xu Jia. Multishotmaster: A controllable multi-shot video generation framework.arXiv preprint arXiv:2512.03041,

  44. [44]

    Internvid: A large-scale video-text dataset for multimodal understanding and generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. InInter- national Conference on Learning Representations, pages 42055–42079, 2024. 5

  45. [45]

    Qwen-image technical report,

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

  46. [46]

    Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models.arXiv preprint arXiv:2508.11484, 2025

    Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, and Xinyuan Chen. Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models.arXiv preprint arXiv:2508.11484, 2025. 2

  47. [47]

    Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025

    Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan L Yuille, and Lu Jiang. Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025. 2

  48. [48]

    Motioncanvas: Cinematic shot design with controllable image-to-video generation

    Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Anirud- dha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image-to-video generation. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–11, 2025. 2

  49. [49]

    Cogvideox: Text-to- video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xi- aohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to- video diffusion models with an expert transformer. InIn- ternational Conference on Learning Representations, pages 83048–83077, 2025. 2

  50. [50]

    Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

  51. [51]

    Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

    Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, and Xingang Pan. Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025. 2, 4, 5

  52. [52]

    Frame context packing and drift prevention in next-frame-prediction video diffusion models

    Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame context packing and drift prevention in next-frame-prediction video diffusion models. Advances in Neural Information Processing Systems, 38: 30546–30566, 2026. 2

  53. [53]

    Stage: Storyboard-anchored gener- ation for cinematic multi-shot narrative.arXiv preprint arXiv:2512.12372, 2025

    Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, and Boxin Shi. Stage: Storyboard-anchored gener- ation for cinematic multi-shot narrative.arXiv preprint arXiv:2512.12372, 2025. 2

  54. [54]

    Fo- leycrafter: Bring silent videos to life with lifelike and syn- chronized sounds.International Journal of Computer Vision, 134(1):46, 2026

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Bin Liu, and Kai Chen. Fo- leycrafter: Bring silent videos to life with lifelike and syn- chronized sounds.International Journal of Computer Vision, 134(1):46, 2026. 3

  55. [55]

    Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 2

  56. [56]

    Videomemory: Toward consis- tent video generation via memory integration.arXiv preprint arXiv:2601.03655, 2026

    Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang, Yehang Zhang, Shuaibo Li, Xiaojun Hu, Bolan Su, and Ying-cong Chen. Videomemory: Toward consis- tent video generation via memory integration.arXiv preprint arXiv:2601.03655, 2026. 2, 3

  57. [57]

    Nipemaembemawili,haraka.\

    Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2024. 2 HoloCine(14B, no Audio) Story:TheShadowWall(Noon-screencaptions,nosignswithEnglishtext)Shot1:Mediumstaticshot.Warmstudiol...

  58. [58]

    and an inferred transition type (CONTINUE,DISSOLVE, HARD) used to populate the cut-type prior in the boundary- conditioned gating module.(2) Subtitle handling.Each shot is annotated with its burnt-in subtitle text, if any. Dur- ing training,20%of samples randomly apply a subtitle- removal model to produce a subtitle-free variant, enabling the model to gen...