Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

Divya Kothandaraman; Ming C. Lin; Taewon Kang

arxiv: 2503.06310 · v4 · pith:CBPALY2Qnew · submitted 2025-03-08 · 💻 cs.CV

Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

Taewon Kang , Divya Kothandaraman , Ming C. Lin This is my paper

Pith reviewed 2026-05-23 00:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-video generationprompt fusiontemporal coherencediffusion modelsvideo storytellingscene-action continuitylatent blending

0 comments

The pith

Scene-action prompt fusion produces temporally consistent long-form videos from discrete text prompts without extra training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that fuses scene and action prompts to generate coherent extended video sequences. It relies on bidirectional time-weighted latent blending to maintain consistency between segments, a dynamics-informed prompt weighting mechanism that adjusts balances at each diffusion step using alignment and smoothness signals, and semantic action representations to guide transitions. If effective, this yields fluid narratives that preserve both visual context and action flow across multiple clips. The approach claims to outperform existing methods on temporal coherence and visual quality in experiments.

Core claim

The method integrates three components—bidirectional time-weighted latent blending for temporal consistency, dynamics-informed prompt weighting that balances scene and action prompts based on CLIP alignment, narrative progression, and temporal smoothness, and semantic action representations that modulate transitions by action similarity—to enable fluid video narratives from discrete text prompts that reflect both scene context and action dynamics.

What carries the argument

Dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances scene and action prompts at each diffusion timestep using CLIP-based alignment combined with narrative progression and temporal smoothness signals.

If this is right

Latent-space blending with bidirectional temporal constraints prevents abrupt transitions between video segments.
The method produces long-form videos that maintain semantic consistency and scene-action continuity.
No additional training is required to achieve these results on top of existing diffusion models.
Experiments show significant outperformance over baselines in temporal consistency and visual quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The blending strategy could be tested on other sequential generation tasks like multi-shot image sequences.
If the weighting signals generalize, the approach might reduce reliance on fine-tuning for domain-specific video outputs.
Extending the semantic action representation to handle more complex narrative structures could address longer stories.

Load-bearing premise

The dynamics-informed prompt weighting can reliably balance scene and action prompts at each step to enforce consistency across video segments.

What would settle it

Generated videos that exhibit abrupt scene changes, inconsistent actions, or loss of narrative progression when the prompt weighting is applied would show the balancing mechanism does not hold.

Figures

Figures reproduced from arXiv: 2503.06310 by Divya Kothandaraman, Ming C. Lin, Taewon Kang.

**Figure 1.** Figure 1: Example of story generation using our proposed method Each row represents a video segment V12, V34, V56, where the scene and motion evolve according to the specified textual prompts. The first prompt (P1) describes a detailed action (e.g., ”Tom Cruise is trying to catch Taylor Swift at the Tidal Basin surrounded by cherry blossoms”), with the second prompt (P2) follows (e.g., ”Tom Cruise is running”). Our … view at source ↗

**Figure 2.** Figure 2: Overview of our proposed storytelling framework. Each video segment V2N−1,2N is generated from a pair of prompts (P2N−1, P2N ), ensuring coherent short-form synthesis. These segments are then sequentially combined to form a complete story. We incorporate Time-Weighted Blending (TWB) and Semantic Action Representation (SAR) to maintain temporal coherence and logical action continuity across segments. TWB en… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison across three sequence types: (1) multiple character sequence, (2) sequential action sequence, and (3) living being–object interaction sequence. Our model consistently maintains character positioning, background consistency, motion continuity, and natural interactions across diverse scenarios. For additional video demonstrations, please refer to the supplementary material (see ??). Me… view at source ↗

**Figure 4.** Figure 4: Ablation Study Each row shows a different setting: Full Model, w/o TWB, w/o DIPW, w/o SAR, and a baseline (Mochi). The full model produces the most coherent motion and character interactions. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Generating coherent long-form video sequences from discrete text prompts remains challenging due to difficulties in maintaining temporal coherence, semantic consistency, and scene-action continuity across segments. We propose a novel storytelling framework that integrates scene and action prompts through dynamics-inspired prompt mixing. Our approach combines three key components: (i) a bidirectional time-weighted latent blending strategy that enforces temporal consistency between consecutive video segments, (ii) a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances scene and action prompts at each diffusion timestep based on CLIP-based alignment, narrative progression, and temporal smoothness, and (iii) a semantic action representation that encodes high-level action semantics to modulate transitions according to action similarity. Latent-space blending preserves spatial coherence within scenes, while time-weighted blending introduces bidirectional temporal constraints to prevent abrupt transitions. Together, these components enable fluid and coherent video narratives that faithfully reflect both scene context and action dynamics. Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos without any additional training, thereby bridging the gap between short clips and extended text-driven video storytelling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is an incremental assembly of existing diffusion and CLIP components for video storytelling with no quantitative evidence provided for the performance claims.

read the letter

The main thing to know is that the paper describes a prompt fusion approach for longer coherent text-to-video using latent blending and adaptive weighting, but the abstract supplies no metrics, baselines, or experiment details to support the outperformance assertions. The three components are bidirectional time-weighted latent blending for temporal consistency, a dynamics-informed prompt weighting mechanism that uses CLIP alignment plus narrative and smoothness signals, and semantic action representations to guide transitions. These are applied to the storytelling case without extra training. The combination for scene-action continuity is a reasonable engineering step and addresses a practical pain point in current video models. The paper does a clear job naming the continuity problems across segments. The soft spots are the missing evidence. The abstract states significant gains in temporal consistency and visual quality but shows nothing to back that up—no numbers, no comparisons, no ablations. That makes the central claim unsupported from what is visible. The DIPW weighting is described at a high level as adaptive, but without results it remains an untested assertion. This work is aimed at practitioners building text-to-video tools who might want prompt-mixing ideas for extended sequences. A reader seeking new theory or strong empirical validation will not find it here. I would not recommend sending it for peer review until the experiments and comparisons are added and shown to hold up.

Referee Report

2 major / 1 minor

Summary. The paper claims to solve the problem of generating coherent long-form video sequences from discrete text prompts by proposing a framework that integrates scene and action prompts through dynamics-inspired prompt mixing. The method consists of three components: bidirectional time-weighted latent blending for temporal consistency, dynamics-informed prompt weighting (DIPW) that adaptively balances prompts using CLIP-based alignment, narrative progression, and temporal smoothness, and a semantic action representation for modulating transitions. It asserts that this enables fluid video narratives and that extensive experiments show significant outperformance over baselines without any additional training.

Significance. If the claims are substantiated, this would represent a meaningful advance in text-to-video generation by addressing temporal coherence and scene-action continuity in extended sequences, a major hurdle in the field. The training-free nature of the approach is a notable practical advantage that could facilitate broader adoption in storytelling applications.

major comments (2)

[Abstract] Abstract: The central claim that 'Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos' is not accompanied by any quantitative metrics, baseline descriptions, ablation studies, or experimental details. This absence directly undermines the ability to assess the validity of the outperformance assertion, which is load-bearing for the paper's contribution.
[Abstract] Abstract: The DIPW mechanism is presented as adaptively balancing scene and action prompts based on CLIP-based alignment, narrative progression, and temporal smoothness signals, but no mathematical formulation, algorithm, or specific implementation details are provided. This makes it impossible to evaluate whether the mechanism can reliably enforce consistency as claimed.

minor comments (1)

The abstract uses several technical terms (e.g., 'bidirectional time-weighted latent blending', 'dynamics-informed prompt weighting') without brief explanations, which may reduce accessibility for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these detailed comments on the abstract. We agree that the abstract would benefit from additional substantiation of the experimental claims and a higher-level reference to the DIPW formulation to improve readability and verifiability. We address each point below and will revise the abstract in the next version.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos' is not accompanied by any quantitative metrics, baseline descriptions, ablation studies, or experimental details. This absence directly undermines the ability to assess the validity of the outperformance assertion, which is load-bearing for the paper's contribution.

Authors: We acknowledge that the abstract presents the performance claim at a summary level without metrics or experimental specifics. The full manuscript contains a dedicated Experiments section with quantitative metrics (e.g., temporal consistency scores, CLIP-based alignment measures), baseline comparisons, and ablation studies. To address the concern, we will revise the abstract to include a brief reference to key quantitative outcomes and direct readers to the Experiments section. revision: yes
Referee: [Abstract] Abstract: The DIPW mechanism is presented as adaptively balancing scene and action prompts based on CLIP-based alignment, narrative progression, and temporal smoothness signals, but no mathematical formulation, algorithm, or specific implementation details are provided. This makes it impossible to evaluate whether the mechanism can reliably enforce consistency as claimed.

Authors: We agree the abstract describes DIPW at a conceptual level. The manuscript provides the full mathematical formulation (including the adaptive weighting function combining CLIP similarity, narrative progression, and temporal smoothness terms) and algorithmic details in the Methods section. We will revise the abstract to incorporate a concise reference to the core weighting equation and a pointer to the detailed derivation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical assembly of three components (bidirectional time-weighted latent blending, DIPW mechanism, semantic action representation) whose performance claims rest on experimental outperformance rather than any derivation, equation, or prediction that reduces to fitted inputs or self-citations by construction. No equations, uniqueness theorems, ansatzes, or self-referential predictions appear in the provided text; the central claims are externally falsifiable via benchmarks and do not invoke load-bearing self-citations. This is the expected non-finding for a methods paper whose contributions are architectural and empirical.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no mathematical formulations, free parameters, axioms, or new entities; review is limited to the high-level description provided.

pith-pipeline@v0.9.0 · 5725 in / 1134 out tokens · 45630 ms · 2026-05-23T00:03:39.787735+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Character-Centered Dialogue Generation from Scene-Level Prompts
cs.CV 2025-05 unverdicted novelty 4.0

A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Mochi-1 preview.https://huggingface

Genmo AI. Mochi-1 preview.https://huggingface. co/genmo/mochi-1-preview, 2024. 1, 3, 6

work page 2024
[2]

Lumiere: A space-time diffu- sion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 3

work page 2024
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

8 Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. 8 Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 3

work page 2023
[5]

Stream- ing video diffusion: Online video editing with diffusion models, 2024

Feng Chen, Zhen Yang, Bohan Zhuang, and Qi Wu. Stream- ing video diffusion: Online video editing with diffusion models, 2024. 1

work page 2024
[6]

Control-a-video: Controllable text-to-video generation with diffusion models

Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 3

work page 2023
[7]

Seine: Short-to-long video diffu- sion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. InThe Twelfth International Conference on Learning Representa- tions, 2023. 3

work page 2023
[8]

Learning temporal coherence via self- supervision for gan-based video generation.ACM Transac- tions on Graphics (TOG), 39(4):75–1, 2020

Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taix ´e, and Nils Thuerey. Learning temporal coherence via self- supervision for gan-based video generation.ACM Transac- tions on Graphics (TOG), 39(4):75–1, 2020. 1

work page 2020
[9]

Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024. 3

work page 2024
[10]

Preserve your own correlation: A noise prior for video diffusion models

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming- Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22930–22941, 2023

work page 2023
[11]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz- ing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 3

work page arXiv 2023
[12]

Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024. 1, 3

work page arXiv 2024
[13]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023

Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023. 1, 3

work page arXiv 2023
[15]

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024. 3

work page arXiv 2024
[16]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 3

work page 2023
[19]

Deep video inpainting

Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Deep video inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5792–5801, 2019. 1

work page 2019
[20]

Prompt mixing in diffusion models using the black scholes algorithm.arXiv preprint arXiv:2405.13685, 2024

Divya Kothandaraman, Ming Lin, and Dinesh Manocha. Prompt mixing in diffusion models using the black scholes algorithm.arXiv preprint arXiv:2405.13685, 2024. 6

work page arXiv 2024
[21]

Storygan: A sequential conditional gan for story vi- sualization

Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story vi- sualization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6329–6338,

work page
[22]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

work page arXiv
[23]

Im- proving generation and evaluation of visual stories via se- mantic consistency.arXiv preprint arXiv:2105.10026, 2021

Adyasha Maharana, Darryl Hannan, and Mohit Bansal. Im- proving generation and evaluation of visual stories via se- mantic consistency.arXiv preprint arXiv:2105.10026, 2021. 1

work page arXiv 2021
[24]

Video frame in- terpolation via adaptive separable convolution

Simon Niklaus, Long Mai, and Feng Liu. Video frame in- terpolation via adaptive separable convolution. InProceed- ings of the IEEE international conference on computer vi- sion, pages 261–270, 2017. 1

work page 2017
[25]

Mevg: Multi-event video generation with text-to-video models

Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, and Sangpil Kim. Mevg: Multi-event video generation with text-to-video models. In European Conference on Computer Vision, pages 401–418. Springer, 2024. 1, 3

work page 2024
[26]

Video generation models as world simula- tors.https : / / openai

OpenAI. Video generation models as world simula- tors.https : / / openai . com / index / video - generation - models - as - world - simulators/, 2024

work page 2024
[27]

Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro- vic, and Yuming Du

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sin...

work page 2025
[28]

Hier- archical spatio-temporal decoupling for text-to-video gener- ation

Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hier- archical spatio-temporal decoupling for text-to-video gener- ation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 6635–6645, 2024

work page 2024
[29]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023. 1

work page arXiv 2023
[30]

Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, An- drew Pierson, Ankush Gupta, Austin Waters, A¨aron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nand- wani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt...

work page 2024
[31]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Unsupervised learning of video representations using lstms

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi- nov. Unsupervised learning of video representations using lstms. InInternational conference on machine learning, pages 843–852. PMLR, 2015. 1

work page 2015
[33]

Veo-Team, :, Agrim Gupta, Ali Razavi, Andeep Toor, Ankush Gupta, Dumitru Erhan, Eleni Shaw, Eric Lau, Frank Belletti, Gabe Barth-Maron, Gregory Shaw, Hakan Erdo- gan, Hakim Sidahmed, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jeff Donahue, Jos ´e Lezama, Kory Mathewson, Kurtis David, Matthieu Kim Lorrain, Marc van Zee, Medhini Narasimhan, Mi...

work page 2024
[34]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023. 1

work page arXiv 2023
[36]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024. 3

work page 2024
[38]

Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024

Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024. 1, 3, 8

work page arXiv 2024
[39]

Art-v: Auto-regressive text-to- video generation with diffusion models

Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jian- min Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to- video generation with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024. 3

work page 2024
[40]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023

Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023. 1, 3

work page arXiv 2023
[42]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024

work page 2024
[43]

the words ‘KEEP OFF THE GRASS

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023. 3

work page arXiv 2023
[44]

Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024

Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, and Chunhua Shen. Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024. 1, 3

work page arXiv 2024
[45]

Temporalstory: Enhancing consistency in story visualization using spatial-temporal at- tention.arXiv e-prints, pages arXiv–2407, 2024

Sixiao Zheng and Yanwei Fu. Temporalstory: Enhancing consistency in story visualization using spatial-temporal at- tention.arXiv e-prints, pages arXiv–2407, 2024. 1 10

work page 2024
[46]

Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025. 3

work page 2025
[47]

Full Model

Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang. Vlogger: Make your dream a vlog. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8806–8817, 2024. 1, 3, 6 11 A. Supplementary Materials A.1. Ethics Statement Ethics Statement All names, characters, and events appearing in...

work page arXiv 2024

[1] [1]

Mochi-1 preview.https://huggingface

Genmo AI. Mochi-1 preview.https://huggingface. co/genmo/mochi-1-preview, 2024. 1, 3, 6

work page 2024

[2] [2]

Lumiere: A space-time diffu- sion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 3

work page 2024

[3] [3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

8 Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. 8 Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 3

work page 2023

[5] [5]

Stream- ing video diffusion: Online video editing with diffusion models, 2024

Feng Chen, Zhen Yang, Bohan Zhuang, and Qi Wu. Stream- ing video diffusion: Online video editing with diffusion models, 2024. 1

work page 2024

[6] [6]

Control-a-video: Controllable text-to-video generation with diffusion models

Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 3

work page 2023

[7] [7]

Seine: Short-to-long video diffu- sion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. InThe Twelfth International Conference on Learning Representa- tions, 2023. 3

work page 2023

[8] [8]

Learning temporal coherence via self- supervision for gan-based video generation.ACM Transac- tions on Graphics (TOG), 39(4):75–1, 2020

Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taix ´e, and Nils Thuerey. Learning temporal coherence via self- supervision for gan-based video generation.ACM Transac- tions on Graphics (TOG), 39(4):75–1, 2020. 1

work page 2020

[9] [9]

Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024. 3

work page 2024

[10] [10]

Preserve your own correlation: A noise prior for video diffusion models

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming- Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22930–22941, 2023

work page 2023

[11] [11]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz- ing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 3

work page arXiv 2023

[12] [12]

Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024. 1, 3

work page arXiv 2024

[13] [13]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023

Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023. 1, 3

work page arXiv 2023

[15] [15]

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024. 3

work page arXiv 2024

[16] [16]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 3

work page 2023

[19] [19]

Deep video inpainting

Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Deep video inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5792–5801, 2019. 1

work page 2019

[20] [20]

Prompt mixing in diffusion models using the black scholes algorithm.arXiv preprint arXiv:2405.13685, 2024

Divya Kothandaraman, Ming Lin, and Dinesh Manocha. Prompt mixing in diffusion models using the black scholes algorithm.arXiv preprint arXiv:2405.13685, 2024. 6

work page arXiv 2024

[21] [21]

Storygan: A sequential conditional gan for story vi- sualization

Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story vi- sualization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6329–6338,

work page

[22] [22]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

work page arXiv

[23] [23]

Im- proving generation and evaluation of visual stories via se- mantic consistency.arXiv preprint arXiv:2105.10026, 2021

Adyasha Maharana, Darryl Hannan, and Mohit Bansal. Im- proving generation and evaluation of visual stories via se- mantic consistency.arXiv preprint arXiv:2105.10026, 2021. 1

work page arXiv 2021

[24] [24]

Video frame in- terpolation via adaptive separable convolution

Simon Niklaus, Long Mai, and Feng Liu. Video frame in- terpolation via adaptive separable convolution. InProceed- ings of the IEEE international conference on computer vi- sion, pages 261–270, 2017. 1

work page 2017

[25] [25]

Mevg: Multi-event video generation with text-to-video models

Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, and Sangpil Kim. Mevg: Multi-event video generation with text-to-video models. In European Conference on Computer Vision, pages 401–418. Springer, 2024. 1, 3

work page 2024

[26] [26]

Video generation models as world simula- tors.https : / / openai

OpenAI. Video generation models as world simula- tors.https : / / openai . com / index / video - generation - models - as - world - simulators/, 2024

work page 2024

[27] [27]

Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro- vic, and Yuming Du

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sin...

work page 2025

[28] [28]

Hier- archical spatio-temporal decoupling for text-to-video gener- ation

Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hier- archical spatio-temporal decoupling for text-to-video gener- ation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 6635–6645, 2024

work page 2024

[29] [29]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023. 1

work page arXiv 2023

[30] [30]

Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, An- drew Pierson, Ankush Gupta, Austin Waters, A¨aron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nand- wani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt...

work page 2024

[31] [31]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Unsupervised learning of video representations using lstms

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudi- nov. Unsupervised learning of video representations using lstms. InInternational conference on machine learning, pages 843–852. PMLR, 2015. 1

work page 2015

[33] [33]

Veo-Team, :, Agrim Gupta, Ali Razavi, Andeep Toor, Ankush Gupta, Dumitru Erhan, Eleni Shaw, Eric Lau, Frank Belletti, Gabe Barth-Maron, Gregory Shaw, Hakan Erdo- gan, Hakim Sidahmed, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jeff Donahue, Jos ´e Lezama, Kory Mathewson, Kurtis David, Matthieu Kim Lorrain, Marc van Zee, Medhini Narasimhan, Mi...

work page 2024

[34] [34]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023. 1

work page arXiv 2023

[36] [36]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024. 3

work page 2024

[38] [38]

Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024

Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024. 1, 3, 8

work page arXiv 2024

[39] [39]

Art-v: Auto-regressive text-to- video generation with diffusion models

Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jian- min Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to- video generation with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024. 3

work page 2024

[40] [40]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023

Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023. 1, 3

work page arXiv 2023

[42] [42]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024

work page 2024

[43] [43]

the words ‘KEEP OFF THE GRASS

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023. 3

work page arXiv 2023

[44] [44]

Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024

Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, and Chunhua Shen. Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024. 1, 3

work page arXiv 2024

[45] [45]

Temporalstory: Enhancing consistency in story visualization using spatial-temporal at- tention.arXiv e-prints, pages arXiv–2407, 2024

Sixiao Zheng and Yanwei Fu. Temporalstory: Enhancing consistency in story visualization using spatial-temporal at- tention.arXiv e-prints, pages arXiv–2407, 2024. 1 10

work page 2024

[46] [46]

Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025. 3

work page 2025

[47] [47]

Full Model

Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang. Vlogger: Make your dream a vlog. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8806–8817, 2024. 1, 3, 6 11 A. Supplementary Materials A.1. Ethics Statement Ethics Statement All names, characters, and events appearing in...

work page arXiv 2024