pith. sign in

arxiv: 2505.16819 · v4 · pith:PPYN4PL6new · submitted 2025-05-22 · 💻 cs.CV

Character-Centered Dialogue Generation from Scene-Level Prompts

Pith reviewed 2026-05-22 13:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords dialogue generationscene-based videocharacter consistencymultimodal storytellingrecursive narrative bankvision-language modeltraining-free pipelineexpressive speech synthesis
0
0 comments X

The pith

A modular pipeline generates character-consistent dialogue from scene prompts by grounding an LLM in visual semantics and a recursive narrative memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to add natural spoken dialogue to scene-based video generation without training any new models. It processes paired prompts for each scene's setting and character actions, pulls high-level visual meaning from a generated scene image using an existing vision-language encoder, and passes that plus the prompts to a large language model to create expressive utterances. A speaker-aware Recursive Narrative Bank stores each character's dialogue history across scenes so the language model can keep emotional tone and narrative goals consistent as the story unfolds. The final step converts the text into character-specific speech, yielding fully voiced multimodal videos. This training-free approach aims to make AI storytelling feel more like traditional character-driven narratives.

Core claim

The central claim is that combining a pretrained vision-language encoder's high-level scene semantics with structured setting and behavior prompts, guided by a Recursive Narrative Bank that accumulates speaker-specific history, lets a large language model produce expressive, contextually consistent dialogue that can be rendered as character-conditioned speech, all without fine-tuning or additional supervision.

What carries the argument

The Recursive Narrative Bank: a speaker-aware, temporally structured memory that accumulates each character's dialogue history to maintain evolving goals, social context, and narrative roles across scenes.

If this is right

  • The same pipeline can produce coherent, fully voiced audiovisual stories across diverse settings and story lengths without retraining.
  • Dialogue remains grounded in both the visual content of each scene and the accumulated narrative context.
  • Character expression and emotional tone evolve naturally with the story rather than resetting per scene.
  • The framework scales to new story domains by swapping only the underlying story generation and speech synthesis models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory structure could support real-time interactive stories where user choices update the narrative bank on the fly.
  • Pairing this dialogue module with controllable video generators might reduce visible mismatches between spoken words and on-screen actions.
  • Extending the bank to track non-verbal cues extracted from the vision encoder could further tighten audiovisual alignment.

Load-bearing premise

A pretrained vision-language encoder can extract high-level visual semantics that are sufficient to guide a large language model in producing expressive dialogue consistent with both the current scene prompts and prior narrative context, without any fine-tuning.

What would settle it

Generate videos from a series of scene prompts and check whether the spoken dialogue often fails to match visible actions or character traits visible in the rendered scenes, or whether later scenes contradict earlier dialogue history.

Figures

Figures reproduced from arXiv: 2505.16819 by Ming C. Lin, Taewon Kang.

Figure 1
Figure 1. Figure 1: Storytelling across Thematic Settings. We showcase multimodal storytelling across four distinct settings using animated characters. (a) Urban Exploration in San Francisco (Top, L) Shrek and Donkey reflect on the beauty of the Pacific sunset and the Golden Gate, blending humorous banter with visual splendor. (b) Urban Exploration in Las Vegas (Top, R): Kevin and Bob (Minions) navigate the vibrant nightlife … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed multimodal storytelling framework. Given a sequence of paired prompts (p1, p2),(p3, p4), . . . ,(p2N−1, p2N ), our system generates coherent video scenes, natural dialogue, and expressive speech. A story gener￾ation model (e.g., Text2Story (T2S)) synthesizes short video clips for each prompt pair, which are blended using Time-Weighted Blending (TWB) and refined with Semantic Action… view at source ↗
Figure 3
Figure 3. Figure 3: End-to-End Comparison of Character-Driven Narrative Generation with Speech Rendering. We qualitatively compare three systems—Ours (left), Mochi + Speech Rendering (middle), and Vlogger + Speech Rendering (right)—in generating multimodal dialogue sequences from structured prompts. Each sequence includes sampled frames, generated dialogue, and corresponding audio spectrograms. While all systems receive ident… view at source ↗
Figure 4
Figure 4. Figure 4: Human Subjects Evaluation Results. (Left) Model comparison results across three conditions (Speech, Dialogue, End-to-End). Our model (dark blue) was preferred in the vast majority of cases, outperforming all baselines including Tacotron 2, Bark, BLIP, Mochi, and Vlogger. (Center) Ablation results for Recursive Narrative Bank and Key Frame Conditioning. Our full model again dominates, indicating the importa… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Ablation Results on Dialogue Coherence. (Full Version) We showcase dialogue generated by our full model compared to ablated variants without the Recursive Narrative Bank (w/o RNB) and without Keyframe Conditioning (w/o KFC). The absence of RNB leads to disrupted narrative flow across scenes, while removing KFC yields contextually irrelevant or less grounded utterances. These results highlight t… view at source ↗
read the original abstract

Recent advances in scene-based video generation enable coherent visual narratives from structured prompts, yet a key aspect of storytelling -- character-driven dialogue and speech -- remains underexplored. We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded dialogue, enriching scene-based storytelling with natural voice and character expression. Our method takes a pair of prompts per scene, defining the setting and character behavior. While a story generation model such as Text2Story produces the visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and a representative scene image. A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis. To maintain contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank, a speaker-aware, temporally structured memory that accumulates each character's dialogue history. Inspired by Script Theory, this design enables dialogue that reflects evolving goals, social context, and narrative roles. Finally, we render each utterance as expressive, character-conditioned speech, producing fully voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings, providing a scalable solution for coherent, character-grounded audiovisual storytelling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a modular, training-free pipeline for generating character-centered dialogue from scene-level prompts in scene-based video generation. It uses a pretrained vision-language encoder to extract high-level visual semantics from a representative scene image, fuses these with action-level and character-behavior prompts to guide an LLM in synthesizing expressive utterances, maintains cross-scene consistency via a new Recursive Narrative Bank memory module that accumulates speaker-aware dialogue history, and renders the output as character-conditioned speech to produce fully voiced multimodal narratives, drawing inspiration from Script Theory.

Significance. If empirically validated, the work could meaningfully advance character-driven audiovisual storytelling by filling a gap in dialogue generation for coherent scene-based video systems. The training-free design leveraging off-the-shelf components and the conceptual contribution of the Recursive Narrative Bank for temporal and emotional coherence across scenes represent potential strengths for scalability and generalization.

major comments (2)
  1. Abstract: The central claims that the pipeline produces 'expressive, character-consistent utterances' and 'visually and auditorily grounded dialogue' maintained via the Recursive Narrative Bank are unsupported, as the manuscript provides no quantitative evaluations, ablation studies, user studies, or baseline comparisons to substantiate consistency, expressiveness, or visual grounding.
  2. Method description (pipeline overview): The assumption that a pretrained vision-language encoder alone extracts semantics sufficient to guide an LLM toward prompt-grounded and expressive dialogue without fine-tuning or supervision is load-bearing for the training-free claim but remains untested; if the encoder outputs are too coarse, the downstream LLM synthesis cannot compensate as described.
minor comments (1)
  1. Abstract: Clarify the exact update mechanism and data structure of the Recursive Narrative Bank (e.g., how speaker awareness and temporal ordering are implemented) to aid reproducibility, perhaps via pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate to strengthen the presentation of the training-free pipeline and the Recursive Narrative Bank.

read point-by-point responses
  1. Referee: [—] Abstract: The central claims that the pipeline produces 'expressive, character-consistent utterances' and 'visually and auditorily grounded dialogue' maintained via the Recursive Narrative Bank are unsupported, as the manuscript provides no quantitative evaluations, ablation studies, user studies, or baseline comparisons to substantiate consistency, expressiveness, or visual grounding.

    Authors: We agree that the current manuscript supports its claims primarily through qualitative examples and illustrative cases rather than quantitative metrics or user studies. The work focuses on introducing a modular, training-free framework, with the Recursive Narrative Bank as a conceptual contribution for cross-scene consistency. In the revision, we will qualify the abstract language to reflect the available evidence (e.g., 'illustrated through examples') and add a new section presenting extended qualitative results across multiple story domains along with a small-scale human preference study comparing outputs with and without the memory bank. revision: yes

  2. Referee: [—] Method description (pipeline overview): The assumption that a pretrained vision-language encoder alone extracts semantics sufficient to guide an LLM toward prompt-grounded and expressive dialogue without fine-tuning or supervision is load-bearing for the training-free claim but remains untested; if the encoder outputs are too coarse, the downstream LLM synthesis cannot compensate as described.

    Authors: The vision-language encoder is used to supply high-level visual semantics that are explicitly fused with the action-level and character-behavior prompts before being passed to the LLM; the LLM is further guided by structured instructions and the accumulating Recursive Narrative Bank to produce grounded utterances. We acknowledge that this reliance on off-the-shelf components is central to the training-free design and will expand the method section with concrete examples of encoder outputs, the exact fusion prompt template, and a discussion of cases where coarse semantics might limit expressiveness, including how the speaker-aware memory helps mitigate drift across scenes. revision: partial

Circularity Check

0 steps flagged

No circularity: modular pipeline uses external pretrained components without internal reductions or self-referential derivations

full rationale

The paper presents a training-free modular pipeline that composes off-the-shelf vision-language encoders, LLMs, and a newly introduced Recursive Narrative Bank to generate dialogue from scene prompts and images. No equations, parameter fittings, or derivations are described that would reduce outputs to inputs by construction. Claims rest on the independent capabilities of pretrained models and the explicit accumulation of dialogue history for cross-scene coherence, with no load-bearing self-citations or ansatzes imported from prior author work. The derivation chain is self-contained as a descriptive composition rather than a closed mathematical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on assumptions about the capabilities of off-the-shelf pretrained models and introduces one new structural component without external validation.

axioms (1)
  • domain assumption Pretrained vision-language encoders extract high-level semantics that are directly useful for guiding dialogue generation in an LLM.
    Invoked in the step where visual semantics are combined with prompts to condition the language model.
invented entities (1)
  • Recursive Narrative Bank no independent evidence
    purpose: Speaker-aware, temporally structured memory that accumulates each character's dialogue history to enforce contextual and emotional consistency across scenes.
    Presented as an original design inspired by Script Theory; no independent evidence or external benchmarks provided in the abstract.

pith-pipeline@v0.9.0 · 5737 in / 1532 out tokens · 42979 ms · 2026-05-22T13:28:44.225359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Mochi-1 preview.https://huggingface

    Genmo AI. Mochi-1 preview.https://huggingface. co/genmo/mochi-1-preview, 2024. 4

  3. [3]

    Bark: Text-prompted generative audio model

    Suno AI. Bark: Text-prompted generative audio model. https://github.com/suno-ai/bark, 2023. Ac- cessed: 2025-05-13. 4

  4. [4]

    Csm-1b: Conversational speech model

    Sesame AI. Csm-1b: Conversational speech model. https : / / huggingface . co / sesame / csm - 1b,

  5. [5]

    4, 6, 18

    Accessed: 2025-05-13. 4, 6, 18

  6. [6]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  7. [7]

    The art of storytelling: Multi-agent generative ai for dynamic mul- timodal narratives.arXiv preprint arXiv:2409.11261, 2024

    Samee Arif, Taimoor Arif, Muhammad Saad Haroon, Aam- ina Jamal Khan, Agha Ali Raza, and Awais Athar. The art of storytelling: Multi-agent generative ai for dynamic mul- timodal narratives.arXiv preprint arXiv:2409.11261, 2024. 1, 4

  8. [8]

    Sound of story: Multi-modal storytelling with audio.arXiv preprint arXiv:2310.19264, 2023

    Jaeyeon Bae, Seokhoon Jeong, Seokun Kang, Namgi Han, Jae-Yon Lee, Hyounghun Kim, and Taehwan Kim. Sound of story: Multi-modal storytelling with audio.arXiv preprint arXiv:2310.19264, 2023. 1, 4

  9. [9]

    Lumiere: A space-time diffu- sion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 4

  10. [10]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  11. [11]

    Align your latents: High-resolution video synthesis with la- tent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 4

  12. [12]

    Scripts in memory for text.Cognitive psychology, 11(2): 177–220, 1979

    Gordon H Bower, John B Black, and Terrence J Turner. Scripts in memory for text.Cognitive psychology, 11(2): 177–220, 1979. 3, 16, 18

  13. [13]

    Control-a-video: Controllable text-to-video generation with diffusion models

    Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 4

  14. [14]

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly- scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 4

  15. [15]

    Seine: Short-to-long video diffu- sion model for generative transition and prediction

    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. InThe Twelfth International Conference on Learning Representa- tions, 2023. 4

  16. [16]

    Uniter: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InEuropean conference on computer vision, pages 104–120. Springer,

  17. [17]

    Instructblip: Towards general- purpose vision-language models with instruction tuning,

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

  18. [18]

    One-minute video generation with test-time training

    Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025. 4

  19. [19]

    Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms

    Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024. 4

  20. [20]

    Violet: End-to-end video-language transformers with masked visual-token modeling

    Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token mod- eling.arXiv preprint arXiv:2111.12681, 2021. 4

  21. [21]

    Preserve your own correlation: A noise prior for video diffusion models

    Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming- Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22930–22941, 2023. 4

  22. [22]

    Emu video: Factorizing text-to-video generation by explicit image conditioning

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz- ing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 4

  23. [23]

    Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

    Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024. 3

  24. [24]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

  25. [25]

    Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023

    Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023. 3

  26. [26]

    Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

    Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024. 4

  27. [27]

    Visual storytelling

    Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Is- han Misra, Aishwarya Agrawal, Jacob Devlin, Ross Gir- shick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. InProceedings of the 2016 conference of the North American chapter of the association for com- putational linguistics: Human language technologies, pages 1233–1239, 2016. 3

  28. [28]

    Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

    Taewon Kang, Divya Kothandaraman, and Ming C Lin. Text2story: Advancing video storytelling with text guidance. arXiv preprint arXiv:2503.06310, 2025. 1, 4, 5, 6, 7, 19

  29. [29]

    Text2video-zero: Text- to-image diffusion models are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 4

  30. [30]

    Vilt: Vision- and-language transformer without convolution or region su- pervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 4

  31. [31]

    Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

  32. [32]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 4, 5, 6

  33. [33]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023

  34. [34]

    Lavender: Unifying video- language understanding as masked language modeling

    Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video- language understanding as masked language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23119–23129, 2023. 4

  35. [35]

    Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning

    Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

  36. [36]

    Improving visual sto- rytelling with multimodal large language models.arXiv preprint arXiv:2407.02586, 2024

    Xiaochuan Lin and Xiangyong Chen. Improving visual sto- rytelling with multimodal large language models.arXiv preprint arXiv:2407.02586, 2024. 1, 4

  37. [37]

    Intelligent grimm-open-ended visual storytelling via latent diffusion models

    Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yan- feng Wang, and Weidi Xie. Intelligent grimm-open-ended visual storytelling via latent diffusion models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6190–6200, 2024. 3

  38. [38]

    Cheap and quick: Efficient vision-language instruction tuning for large language mod- els.Advances in Neural Information Processing Systems, 36:29615–29627, 2023

    Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xi- aoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language mod- els.Advances in Neural Information Processing Systems, 36:29615–29627, 2023. 4

  39. [39]

    Mevg: Multi-event video generation with text-to-video models

    Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, and Sangpil Kim. Mevg: Multi-event video generation with text-to-video models. In European Conference on Computer Vision, pages 401–418. Springer, 2024. 4

  40. [40]

    Video generation models as world simula- tors.https : / / openai

    OpenAI. Video generation models as world simula- tors.https : / / openai . com / index / video - generation - models - as - world - simulators/,

  41. [41]

    Synthesizing coherent story with auto-regressive la- tent diffusion models

    Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive la- tent diffusion models. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 2920–2930, 2024. 3

  42. [42]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 4

  43. [43]

    Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro- vic, and Yuming Du

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sin...

  44. [44]

    Hier- archical spatio-temporal decoupling for text-to-video gener- ation

    Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hier- archical spatio-temporal decoupling for text-to-video gener- ation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 6635–6645, 2024

  45. [45]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023. 4

  46. [46]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

  47. [47]

    Psychology press, 2013

    Roger C Schank and Robert P Abelson.Scripts, plans, goals, and understanding: An inquiry into human knowledge struc- tures. Psychology press, 2013. 3, 16, 18

  48. [48]

    Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, An- drew Pierson, Ankush Gupta, Austin Waters, A¨aron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nand- wani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt...

  49. [49]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

  50. [50]

    From words to worlds: Trans- forming one-line prompt into immersive multi-modal dig- ital stories with communicative llm agent.arXiv preprint arXiv:2406.10478, 2024

    Samuel S Sohn, Danrui Li, Sen Zhang, Che-Jui Chang, and Mubbasir Kapadia. From words to worlds: Trans- forming one-line prompt into immersive multi-modal dig- ital stories with communicative llm agent.arXiv preprint arXiv:2406.10478, 2024. 1, 4

  51. [51]

    Veo-Team, :, Agrim Gupta, Ali Razavi, Andeep Toor, Ankush Gupta, Dumitru Erhan, Eleni Shaw, Eric Lau, Frank Belletti, Gabe Barth-Maron, Gregory Shaw, Hakan Erdo- gan, Hakim Sidahmed, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jeff Donahue, Jos ´e Lezama, Kory Mathewson, Kurtis David, Matthieu Kim Lorrain, Marc van Zee, Medhini Narasimhan, Mi...

  52. [52]

    Phenaki: Variable Length Video Generation From Open Domain Textual Description

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

  53. [53]

    Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

    Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023. 4

  54. [54]

    GIT: A Generative Image-to-text Transformer for Vision and Language

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022. 4

  55. [55]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 4

  56. [56]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 4

  57. [57]

    Tacotron: Towards End-to-End Speech Synthesis

    Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end- to-end speech synthesis.arXiv preprint arXiv:1703.10135,

  58. [58]

    Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024. 4

  59. [59]

    Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024

    Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024. 4

  60. [60]

    Art-v: Auto-regressive text-to- video generation with diffusion models

    Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jian- min Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to- video generation with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024. 4

  61. [61]

    Planning and understanding: A computa- tional approach to human reasoning

    Robert Wilensky. Planning and understanding: A computa- tional approach to human reasoning. 1983. 3, 16, 18

  62. [62]

    Mm- storyagent: Immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio

    Xuenan Xu, Jiahao Mei, Chenliang Li, Yuning Wu, Ming Yan, Shaopeng Lai, Ji Zhang, and Mengyue Wu. Mm- storyagent: Immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio. arXiv preprint arXiv:2503.05242, 2025. 1, 4

  63. [63]

    Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024

    Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024. 1, 4

  64. [64]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 4

  65. [65]

    Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023

    Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023. 4

  66. [66]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 4

  67. [67]

    Let storytelling tell vivid stories: An expressive and fluent multimodal story- teller.arXiv preprint arXiv:2403.07301, 2024

    Chuanqi Zang, Jiji Tang, Rongsheng Zhang, Zeng Zhao, Tangjie Lv, Mingtao Pei, and Wei Liang. Let storytelling tell vivid stories: An expressive and fluent multimodal story- teller.arXiv preprint arXiv:2403.07301, 2024. 1, 4

  68. [68]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024

    David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024. 4

  69. [69]

    Dialogue director: Bridging the gap in dialogue visualization for multimodal storytelling.arXiv preprint arXiv:2412.20725, 2024

    Min Zhang, Zilin Wang, Liyan Chen, Kunhong Liu, and Jun- cong Lin. Dialogue director: Bridging the gap in dialogue visualization for multimodal storytelling.arXiv preprint arXiv:2412.20725, 2024. 1, 4

  70. [70]

    the words ‘KEEP OFF THE GRASS

    Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023. 4

  71. [71]

    Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024

    Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, and Chunhua Shen. Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024. 3

  72. [72]

    Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025

    Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025. 4

  73. [73]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 4

  74. [74]

    Video A”, “Video B

    Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang. Vlogger: Make your dream a vlog. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8806–8817, 2024. 3 A. Supplementary Material A.1. Ethics Statement Ethics Statement All audio samples used in this work are limited to short...