Character-Centered Dialogue Generation from Scene-Level Prompts

Ming C. Lin; Taewon Kang

arxiv: 2505.16819 · v4 · pith:PPYN4PL6new · submitted 2025-05-22 · 💻 cs.CV

Character-Centered Dialogue Generation from Scene-Level Prompts

Taewon Kang , Ming C. Lin This is my paper

Pith reviewed 2026-05-22 13:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords dialogue generationscene-based videocharacter consistencymultimodal storytellingrecursive narrative bankvision-language modeltraining-free pipelineexpressive speech synthesis

0 comments

The pith

A modular pipeline generates character-consistent dialogue from scene prompts by grounding an LLM in visual semantics and a recursive narrative memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to add natural spoken dialogue to scene-based video generation without training any new models. It processes paired prompts for each scene's setting and character actions, pulls high-level visual meaning from a generated scene image using an existing vision-language encoder, and passes that plus the prompts to a large language model to create expressive utterances. A speaker-aware Recursive Narrative Bank stores each character's dialogue history across scenes so the language model can keep emotional tone and narrative goals consistent as the story unfolds. The final step converts the text into character-specific speech, yielding fully voiced multimodal videos. This training-free approach aims to make AI storytelling feel more like traditional character-driven narratives.

Core claim

The central claim is that combining a pretrained vision-language encoder's high-level scene semantics with structured setting and behavior prompts, guided by a Recursive Narrative Bank that accumulates speaker-specific history, lets a large language model produce expressive, contextually consistent dialogue that can be rendered as character-conditioned speech, all without fine-tuning or additional supervision.

What carries the argument

The Recursive Narrative Bank: a speaker-aware, temporally structured memory that accumulates each character's dialogue history to maintain evolving goals, social context, and narrative roles across scenes.

If this is right

The same pipeline can produce coherent, fully voiced audiovisual stories across diverse settings and story lengths without retraining.
Dialogue remains grounded in both the visual content of each scene and the accumulated narrative context.
Character expression and emotional tone evolve naturally with the story rather than resetting per scene.
The framework scales to new story domains by swapping only the underlying story generation and speech synthesis models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory structure could support real-time interactive stories where user choices update the narrative bank on the fly.
Pairing this dialogue module with controllable video generators might reduce visible mismatches between spoken words and on-screen actions.
Extending the bank to track non-verbal cues extracted from the vision encoder could further tighten audiovisual alignment.

Load-bearing premise

A pretrained vision-language encoder can extract high-level visual semantics that are sufficient to guide a large language model in producing expressive dialogue consistent with both the current scene prompts and prior narrative context, without any fine-tuning.

What would settle it

Generate videos from a series of scene prompts and check whether the spoken dialogue often fails to match visible actions or character traits visible in the rendered scenes, or whether later scenes contradict earlier dialogue history.

Figures

Figures reproduced from arXiv: 2505.16819 by Ming C. Lin, Taewon Kang.

**Figure 1.** Figure 1: Storytelling across Thematic Settings. We showcase multimodal storytelling across four distinct settings using animated characters. (a) Urban Exploration in San Francisco (Top, L) Shrek and Donkey reflect on the beauty of the Pacific sunset and the Golden Gate, blending humorous banter with visual splendor. (b) Urban Exploration in Las Vegas (Top, R): Kevin and Bob (Minions) navigate the vibrant nightlife … view at source ↗

**Figure 2.** Figure 2: Overview of our proposed multimodal storytelling framework. Given a sequence of paired prompts (p1, p2),(p3, p4), . . . ,(p2N−1, p2N ), our system generates coherent video scenes, natural dialogue, and expressive speech. A story generation model (e.g., Text2Story (T2S)) synthesizes short video clips for each prompt pair, which are blended using Time-Weighted Blending (TWB) and refined with Semantic Action… view at source ↗

**Figure 3.** Figure 3: End-to-End Comparison of Character-Driven Narrative Generation with Speech Rendering. We qualitatively compare three systems—Ours (left), Mochi + Speech Rendering (middle), and Vlogger + Speech Rendering (right)—in generating multimodal dialogue sequences from structured prompts. Each sequence includes sampled frames, generated dialogue, and corresponding audio spectrograms. While all systems receive ident… view at source ↗

**Figure 4.** Figure 4: Human Subjects Evaluation Results. (Left) Model comparison results across three conditions (Speech, Dialogue, End-to-End). Our model (dark blue) was preferred in the vast majority of cases, outperforming all baselines including Tacotron 2, Bark, BLIP, Mochi, and Vlogger. (Center) Ablation results for Recursive Narrative Bank and Key Frame Conditioning. Our full model again dominates, indicating the importa… view at source ↗

**Figure 5.** Figure 5: Qualitative Ablation Results on Dialogue Coherence. (Full Version) We showcase dialogue generated by our full model compared to ablated variants without the Recursive Narrative Bank (w/o RNB) and without Keyframe Conditioning (w/o KFC). The absence of RNB leads to disrupted narrative flow across scenes, while removing KFC yields contextually irrelevant or less grounded utterances. These results highlight t… view at source ↗

read the original abstract

Recent advances in scene-based video generation enable coherent visual narratives from structured prompts, yet a key aspect of storytelling -- character-driven dialogue and speech -- remains underexplored. We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded dialogue, enriching scene-based storytelling with natural voice and character expression. Our method takes a pair of prompts per scene, defining the setting and character behavior. While a story generation model such as Text2Story produces the visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and a representative scene image. A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis. To maintain contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank, a speaker-aware, temporally structured memory that accumulates each character's dialogue history. Inspired by Script Theory, this design enables dialogue that reflects evolving goals, social context, and narrative roles. Finally, we render each utterance as expressive, character-conditioned speech, producing fully voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings, providing a scalable solution for coherent, character-grounded audiovisual storytelling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a training-free pipeline description that assembles VLMs, LLMs, and a speaker-aware memory bank for character dialogue in scene video generation, but it offers no evaluations or results to show whether the pieces actually work together.

read the letter

The core of this paper is a modular pipeline that takes scene and character prompts, pulls visual semantics from a pretrained vision-language model, feeds them to an LLM for dialogue, and uses a Recursive Narrative Bank to track each speaker's history across scenes for consistency. It then turns the lines into voiced speech. The Recursive Narrative Bank is the main new piece; it draws from script theory to keep emotional and narrative context as the story moves from scene to scene. The rest is off-the-shelf components wired together without any training, which keeps the method simple and generalizable on paper. That assembly is useful to see laid out clearly for people building storytelling tools. The main weakness is the complete lack of evidence. The abstract and description give no numbers, no ablations, no user studies, and no baseline comparisons on how consistent or expressive the generated dialogue actually is. The central bet—that a frozen VLM will extract semantics fine-grained enough to drive good, prompt-grounded lines without fine-tuning—remains untested here. If the visual features are too coarse, the LLM will produce generic or mismatched speech no matter how clever the memory bank is. The stress-test note on that point holds up based on what is shown. This work is aimed at researchers and engineers working on multimodal video pipelines in entertainment or interactive media. Someone looking for a concrete starting point on adding dialogue to scene generators could pull ideas from the architecture. It is coherent on its own terms and engages the relevant literature, so it deserves a serious referee who can ask for experiments and tighter validation rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a modular, training-free pipeline for generating character-centered dialogue from scene-level prompts in scene-based video generation. It uses a pretrained vision-language encoder to extract high-level visual semantics from a representative scene image, fuses these with action-level and character-behavior prompts to guide an LLM in synthesizing expressive utterances, maintains cross-scene consistency via a new Recursive Narrative Bank memory module that accumulates speaker-aware dialogue history, and renders the output as character-conditioned speech to produce fully voiced multimodal narratives, drawing inspiration from Script Theory.

Significance. If empirically validated, the work could meaningfully advance character-driven audiovisual storytelling by filling a gap in dialogue generation for coherent scene-based video systems. The training-free design leveraging off-the-shelf components and the conceptual contribution of the Recursive Narrative Bank for temporal and emotional coherence across scenes represent potential strengths for scalability and generalization.

major comments (2)

Abstract: The central claims that the pipeline produces 'expressive, character-consistent utterances' and 'visually and auditorily grounded dialogue' maintained via the Recursive Narrative Bank are unsupported, as the manuscript provides no quantitative evaluations, ablation studies, user studies, or baseline comparisons to substantiate consistency, expressiveness, or visual grounding.
Method description (pipeline overview): The assumption that a pretrained vision-language encoder alone extracts semantics sufficient to guide an LLM toward prompt-grounded and expressive dialogue without fine-tuning or supervision is load-bearing for the training-free claim but remains untested; if the encoder outputs are too coarse, the downstream LLM synthesis cannot compensate as described.

minor comments (1)

Abstract: Clarify the exact update mechanism and data structure of the Recursive Narrative Bank (e.g., how speaker awareness and temporal ordering are implemented) to aid reproducibility, perhaps via pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate to strengthen the presentation of the training-free pipeline and the Recursive Narrative Bank.

read point-by-point responses

Referee: [—] Abstract: The central claims that the pipeline produces 'expressive, character-consistent utterances' and 'visually and auditorily grounded dialogue' maintained via the Recursive Narrative Bank are unsupported, as the manuscript provides no quantitative evaluations, ablation studies, user studies, or baseline comparisons to substantiate consistency, expressiveness, or visual grounding.

Authors: We agree that the current manuscript supports its claims primarily through qualitative examples and illustrative cases rather than quantitative metrics or user studies. The work focuses on introducing a modular, training-free framework, with the Recursive Narrative Bank as a conceptual contribution for cross-scene consistency. In the revision, we will qualify the abstract language to reflect the available evidence (e.g., 'illustrated through examples') and add a new section presenting extended qualitative results across multiple story domains along with a small-scale human preference study comparing outputs with and without the memory bank. revision: yes
Referee: [—] Method description (pipeline overview): The assumption that a pretrained vision-language encoder alone extracts semantics sufficient to guide an LLM toward prompt-grounded and expressive dialogue without fine-tuning or supervision is load-bearing for the training-free claim but remains untested; if the encoder outputs are too coarse, the downstream LLM synthesis cannot compensate as described.

Authors: The vision-language encoder is used to supply high-level visual semantics that are explicitly fused with the action-level and character-behavior prompts before being passed to the LLM; the LLM is further guided by structured instructions and the accumulating Recursive Narrative Bank to produce grounded utterances. We acknowledge that this reliance on off-the-shelf components is central to the training-free design and will expand the method section with concrete examples of encoder outputs, the exact fusion prompt template, and a discussion of cases where coarse semantics might limit expressiveness, including how the speaker-aware memory helps mitigate drift across scenes. revision: partial

Circularity Check

0 steps flagged

No circularity: modular pipeline uses external pretrained components without internal reductions or self-referential derivations

full rationale

The paper presents a training-free modular pipeline that composes off-the-shelf vision-language encoders, LLMs, and a newly introduced Recursive Narrative Bank to generate dialogue from scene prompts and images. No equations, parameter fittings, or derivations are described that would reduce outputs to inputs by construction. Claims rest on the independent capabilities of pretrained models and the explicit accumulation of dialogue history for cross-scene coherence, with no load-bearing self-citations or ansatzes imported from prior author work. The derivation chain is self-contained as a descriptive composition rather than a closed mathematical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on assumptions about the capabilities of off-the-shelf pretrained models and introduces one new structural component without external validation.

axioms (1)

domain assumption Pretrained vision-language encoders extract high-level semantics that are directly useful for guiding dialogue generation in an LLM.
Invoked in the step where visual semantics are combined with prompts to condition the language model.

invented entities (1)

Recursive Narrative Bank no independent evidence
purpose: Speaker-aware, temporally structured memory that accumulates each character's dialogue history to enforce contextual and emotional consistency across scenes.
Presented as an original design inspired by Script Theory; no independent evidence or external benchmarks provided in the abstract.

pith-pipeline@v0.9.0 · 5737 in / 1532 out tokens · 42979 ms · 2026-05-22T13:28:44.225359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis... Recursive Narrative Bank... reference-driven voice synthesis
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our training-free framework generalizes across diverse story settings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 14 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mochi-1 preview.https://huggingface

Genmo AI. Mochi-1 preview.https://huggingface. co/genmo/mochi-1-preview, 2024. 4

work page 2024
[3]

Bark: Text-prompted generative audio model

Suno AI. Bark: Text-prompted generative audio model. https://github.com/suno-ai/bark, 2023. Ac- cessed: 2025-05-13. 4

work page 2023
[4]

Csm-1b: Conversational speech model

Sesame AI. Csm-1b: Conversational speech model. https : / / huggingface . co / sesame / csm - 1b,

work page
[5]

4, 6, 18

Accessed: 2025-05-13. 4, 6, 18

work page 2025
[6]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[7]

The art of storytelling: Multi-agent generative ai for dynamic mul- timodal narratives.arXiv preprint arXiv:2409.11261, 2024

Samee Arif, Taimoor Arif, Muhammad Saad Haroon, Aam- ina Jamal Khan, Agha Ali Raza, and Awais Athar. The art of storytelling: Multi-agent generative ai for dynamic mul- timodal narratives.arXiv preprint arXiv:2409.11261, 2024. 1, 4

work page arXiv 2024
[8]

Sound of story: Multi-modal storytelling with audio.arXiv preprint arXiv:2310.19264, 2023

Jaeyeon Bae, Seokhoon Jeong, Seokun Kang, Namgi Han, Jae-Yon Lee, Hyounghun Kim, and Taehwan Kim. Sound of story: Multi-modal storytelling with audio.arXiv preprint arXiv:2310.19264, 2023. 1, 4

work page arXiv 2023
[9]

Lumiere: A space-time diffu- sion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 4

work page 2024
[10]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 4

work page 2023
[12]

Scripts in memory for text.Cognitive psychology, 11(2): 177–220, 1979

Gordon H Bower, John B Black, and Terrence J Turner. Scripts in memory for text.Cognitive psychology, 11(2): 177–220, 1979. 3, 16, 18

work page 1979
[13]

Control-a-video: Controllable text-to-video generation with diffusion models

Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 4

work page 2023
[14]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly- scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Seine: Short-to-long video diffu- sion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. InThe Twelfth International Conference on Learning Representa- tions, 2023. 4

work page 2023
[16]

Uniter: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InEuropean conference on computer vision, pages 104–120. Springer,

work page
[17]

Instructblip: Towards general- purpose vision-language models with instruction tuning,

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

work page
[18]

One-minute video generation with test-time training

Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025. 4

work page 2025
[19]

Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024. 4

work page 2024
[20]

Violet: End-to-end video-language transformers with masked visual-token modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token mod- eling.arXiv preprint arXiv:2111.12681, 2021. 4

work page arXiv 2021
[21]

Preserve your own correlation: A noise prior for video diffusion models

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming- Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22930–22941, 2023. 4

work page 2023
[22]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz- ing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 4

work page arXiv 2023
[23]

Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024. 3

work page arXiv 2024
[24]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023

Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023. 3

work page arXiv 2023
[26]

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024. 4

work page arXiv 2024
[27]

Visual storytelling

Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Is- han Misra, Aishwarya Agrawal, Jacob Devlin, Ross Gir- shick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. InProceedings of the 2016 conference of the North American chapter of the association for com- putational linguistics: Human language technologies, pages 1233–1239, 2016. 3

work page 2016
[28]

Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

Taewon Kang, Divya Kothandaraman, and Ming C Lin. Text2story: Advancing video storytelling with text guidance. arXiv preprint arXiv:2503.06310, 2025. 1, 4, 5, 6, 7, 19

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 4

work page 2023
[30]

Vilt: Vision- and-language transformer without convolution or region su- pervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 4

work page 2021
[31]

Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

work page 2021
[32]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 4, 5, 6

work page 2022
[33]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023

work page 2023
[34]

Lavender: Unifying video- language understanding as masked language modeling

Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video- language understanding as masked language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23119–23129, 2023. 4

work page 2023
[35]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

work page arXiv
[36]

Improving visual sto- rytelling with multimodal large language models.arXiv preprint arXiv:2407.02586, 2024

Xiaochuan Lin and Xiangyong Chen. Improving visual sto- rytelling with multimodal large language models.arXiv preprint arXiv:2407.02586, 2024. 1, 4

work page arXiv 2024
[37]

Intelligent grimm-open-ended visual storytelling via latent diffusion models

Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yan- feng Wang, and Weidi Xie. Intelligent grimm-open-ended visual storytelling via latent diffusion models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6190–6200, 2024. 3

work page 2024
[38]

Cheap and quick: Efficient vision-language instruction tuning for large language mod- els.Advances in Neural Information Processing Systems, 36:29615–29627, 2023

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xi- aoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language mod- els.Advances in Neural Information Processing Systems, 36:29615–29627, 2023. 4

work page 2023
[39]

Mevg: Multi-event video generation with text-to-video models

Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, and Sangpil Kim. Mevg: Multi-event video generation with text-to-video models. In European Conference on Computer Vision, pages 401–418. Springer, 2024. 4

work page 2024
[40]

Video generation models as world simula- tors.https : / / openai

OpenAI. Video generation models as world simula- tors.https : / / openai . com / index / video - generation - models - as - world - simulators/,

work page
[41]

Synthesizing coherent story with auto-regressive la- tent diffusion models

Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive la- tent diffusion models. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 2920–2930, 2024. 3

work page 2024
[42]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro- vic, and Yuming Du

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sin...

work page 2025
[44]

Hier- archical spatio-temporal decoupling for text-to-video gener- ation

Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hier- archical spatio-temporal decoupling for text-to-video gener- ation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 6635–6645, 2024

work page 2024
[45]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023. 4

work page arXiv 2023
[46]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

work page 2021
[47]

Psychology press, 2013

Roger C Schank and Robert P Abelson.Scripts, plans, goals, and understanding: An inquiry into human knowledge struc- tures. Psychology press, 2013. 3, 16, 18

work page 2013
[48]

Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, An- drew Pierson, Ankush Gupta, Austin Waters, A¨aron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nand- wani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt...

work page 2024
[49]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

From words to worlds: Trans- forming one-line prompt into immersive multi-modal dig- ital stories with communicative llm agent.arXiv preprint arXiv:2406.10478, 2024

Samuel S Sohn, Danrui Li, Sen Zhang, Che-Jui Chang, and Mubbasir Kapadia. From words to worlds: Trans- forming one-line prompt into immersive multi-modal dig- ital stories with communicative llm agent.arXiv preprint arXiv:2406.10478, 2024. 1, 4

work page arXiv 2024
[51]

Veo-Team, :, Agrim Gupta, Ali Razavi, Andeep Toor, Ankush Gupta, Dumitru Erhan, Eleni Shaw, Eric Lau, Frank Belletti, Gabe Barth-Maron, Gregory Shaw, Hakan Erdo- gan, Hakim Sidahmed, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jeff Donahue, Jos ´e Lezama, Kory Mathewson, Kurtis David, Matthieu Kim Lorrain, Marc van Zee, Medhini Narasimhan, Mi...

work page 2024
[52]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023. 4

work page arXiv 2023
[54]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 4

work page 2022
[57]

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end- to-end speech synthesis.arXiv preprint arXiv:1703.10135,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024. 4

work page 2024
[59]

Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024

Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024. 4

work page arXiv 2024
[60]

Art-v: Auto-regressive text-to- video generation with diffusion models

Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jian- min Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to- video generation with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024. 4

work page 2024
[61]

Planning and understanding: A computa- tional approach to human reasoning

Robert Wilensky. Planning and understanding: A computa- tional approach to human reasoning. 1983. 3, 16, 18

work page 1983
[62]

Mm- storyagent: Immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio

Xuenan Xu, Jiahao Mei, Chenliang Li, Yuning Wu, Ming Yan, Shaopeng Lai, Ji Zhang, and Mengyue Wu. Mm- storyagent: Immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio. arXiv preprint arXiv:2503.05242, 2025. 1, 4

work page arXiv 2025
[63]

Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024

Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024. 1, 4

work page arXiv 2024
[64]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023

Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023. 4

work page arXiv 2023
[66]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[67]

Let storytelling tell vivid stories: An expressive and fluent multimodal story- teller.arXiv preprint arXiv:2403.07301, 2024

Chuanqi Zang, Jiji Tang, Rongsheng Zhang, Zeng Zhao, Tangjie Lv, Mingtao Pei, and Wei Liang. Let storytelling tell vivid stories: An expressive and fluent multimodal story- teller.arXiv preprint arXiv:2403.07301, 2024. 1, 4

work page arXiv 2024
[68]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024. 4

work page 2024
[69]

Dialogue director: Bridging the gap in dialogue visualization for multimodal storytelling.arXiv preprint arXiv:2412.20725, 2024

Min Zhang, Zilin Wang, Liyan Chen, Kunhong Liu, and Jun- cong Lin. Dialogue director: Bridging the gap in dialogue visualization for multimodal storytelling.arXiv preprint arXiv:2412.20725, 2024. 1, 4

work page arXiv 2024
[70]

the words ‘KEEP OFF THE GRASS

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023. 4

work page arXiv 2023
[71]

Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024

Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, and Chunhua Shen. Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024. 3

work page arXiv 2024
[72]

Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025. 4

work page 2025
[73]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Video A”, “Video B

Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang. Vlogger: Make your dream a vlog. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8806–8817, 2024. 3 A. Supplementary Material A.1. Ethics Statement Ethics Statement All audio samples used in this work are limited to short...

work page 2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Mochi-1 preview.https://huggingface

Genmo AI. Mochi-1 preview.https://huggingface. co/genmo/mochi-1-preview, 2024. 4

work page 2024

[3] [3]

Bark: Text-prompted generative audio model

Suno AI. Bark: Text-prompted generative audio model. https://github.com/suno-ai/bark, 2023. Ac- cessed: 2025-05-13. 4

work page 2023

[4] [4]

Csm-1b: Conversational speech model

Sesame AI. Csm-1b: Conversational speech model. https : / / huggingface . co / sesame / csm - 1b,

work page

[5] [5]

4, 6, 18

Accessed: 2025-05-13. 4, 6, 18

work page 2025

[6] [6]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page

[7] [7]

The art of storytelling: Multi-agent generative ai for dynamic mul- timodal narratives.arXiv preprint arXiv:2409.11261, 2024

Samee Arif, Taimoor Arif, Muhammad Saad Haroon, Aam- ina Jamal Khan, Agha Ali Raza, and Awais Athar. The art of storytelling: Multi-agent generative ai for dynamic mul- timodal narratives.arXiv preprint arXiv:2409.11261, 2024. 1, 4

work page arXiv 2024

[8] [8]

Sound of story: Multi-modal storytelling with audio.arXiv preprint arXiv:2310.19264, 2023

Jaeyeon Bae, Seokhoon Jeong, Seokun Kang, Namgi Han, Jae-Yon Lee, Hyounghun Kim, and Taehwan Kim. Sound of story: Multi-modal storytelling with audio.arXiv preprint arXiv:2310.19264, 2023. 1, 4

work page arXiv 2023

[9] [9]

Lumiere: A space-time diffu- sion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 4

work page 2024

[10] [10]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Align your latents: High-resolution video synthesis with la- tent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 4

work page 2023

[12] [12]

Scripts in memory for text.Cognitive psychology, 11(2): 177–220, 1979

Gordon H Bower, John B Black, and Terrence J Turner. Scripts in memory for text.Cognitive psychology, 11(2): 177–220, 1979. 3, 16, 18

work page 1979

[13] [13]

Control-a-video: Controllable text-to-video generation with diffusion models

Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 4

work page 2023

[14] [14]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly- scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Seine: Short-to-long video diffu- sion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. InThe Twelfth International Conference on Learning Representa- tions, 2023. 4

work page 2023

[16] [16]

Uniter: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InEuropean conference on computer vision, pages 104–120. Springer,

work page

[17] [17]

Instructblip: Towards general- purpose vision-language models with instruction tuning,

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

work page

[18] [18]

One-minute video generation with test-time training

Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025. 4

work page 2025

[19] [19]

Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024. 4

work page 2024

[20] [20]

Violet: End-to-end video-language transformers with masked visual-token modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token mod- eling.arXiv preprint arXiv:2111.12681, 2021. 4

work page arXiv 2021

[21] [21]

Preserve your own correlation: A noise prior for video diffusion models

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming- Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22930–22941, 2023. 4

work page 2023

[22] [22]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz- ing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 4

work page arXiv 2023

[23] [23]

Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024. 3

work page arXiv 2024

[24] [24]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023

Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023. 3

work page arXiv 2023

[26] [26]

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024. 4

work page arXiv 2024

[27] [27]

Visual storytelling

Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Is- han Misra, Aishwarya Agrawal, Jacob Devlin, Ross Gir- shick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. InProceedings of the 2016 conference of the North American chapter of the association for com- putational linguistics: Human language technologies, pages 1233–1239, 2016. 3

work page 2016

[28] [28]

Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

Taewon Kang, Divya Kothandaraman, and Ming C Lin. Text2story: Advancing video storytelling with text guidance. arXiv preprint arXiv:2503.06310, 2025. 1, 4, 5, 6, 7, 19

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 4

work page 2023

[30] [30]

Vilt: Vision- and-language transformer without convolution or region su- pervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 4

work page 2021

[31] [31]

Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

work page 2021

[32] [32]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 4, 5, 6

work page 2022

[33] [33]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023

work page 2023

[34] [34]

Lavender: Unifying video- language understanding as masked language modeling

Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video- language understanding as masked language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23119–23129, 2023. 4

work page 2023

[35] [35]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

work page arXiv

[36] [36]

Improving visual sto- rytelling with multimodal large language models.arXiv preprint arXiv:2407.02586, 2024

Xiaochuan Lin and Xiangyong Chen. Improving visual sto- rytelling with multimodal large language models.arXiv preprint arXiv:2407.02586, 2024. 1, 4

work page arXiv 2024

[37] [37]

Intelligent grimm-open-ended visual storytelling via latent diffusion models

Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yan- feng Wang, and Weidi Xie. Intelligent grimm-open-ended visual storytelling via latent diffusion models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6190–6200, 2024. 3

work page 2024

[38] [38]

Cheap and quick: Efficient vision-language instruction tuning for large language mod- els.Advances in Neural Information Processing Systems, 36:29615–29627, 2023

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xi- aoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language mod- els.Advances in Neural Information Processing Systems, 36:29615–29627, 2023. 4

work page 2023

[39] [39]

Mevg: Multi-event video generation with text-to-video models

Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, and Sangpil Kim. Mevg: Multi-event video generation with text-to-video models. In European Conference on Computer Vision, pages 401–418. Springer, 2024. 4

work page 2024

[40] [40]

Video generation models as world simula- tors.https : / / openai

OpenAI. Video generation models as world simula- tors.https : / / openai . com / index / video - generation - models - as - world - simulators/,

work page

[41] [41]

Synthesizing coherent story with auto-regressive la- tent diffusion models

Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive la- tent diffusion models. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 2920–2930, 2024. 3

work page 2024

[42] [42]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro- vic, and Yuming Du

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sin...

work page 2025

[44] [44]

Hier- archical spatio-temporal decoupling for text-to-video gener- ation

Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hier- archical spatio-temporal decoupling for text-to-video gener- ation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 6635–6645, 2024

work page 2024

[45] [45]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023. 4

work page arXiv 2023

[46] [46]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

work page 2021

[47] [47]

Psychology press, 2013

Roger C Schank and Robert P Abelson.Scripts, plans, goals, and understanding: An inquiry into human knowledge struc- tures. Psychology press, 2013. 3, 16, 18

work page 2013

[48] [48]

Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, An- drew Pierson, Ankush Gupta, Austin Waters, A¨aron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nand- wani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt...

work page 2024

[49] [49]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

From words to worlds: Trans- forming one-line prompt into immersive multi-modal dig- ital stories with communicative llm agent.arXiv preprint arXiv:2406.10478, 2024

Samuel S Sohn, Danrui Li, Sen Zhang, Che-Jui Chang, and Mubbasir Kapadia. From words to worlds: Trans- forming one-line prompt into immersive multi-modal dig- ital stories with communicative llm agent.arXiv preprint arXiv:2406.10478, 2024. 1, 4

work page arXiv 2024

[51] [51]

Veo-Team, :, Agrim Gupta, Ali Razavi, Andeep Toor, Ankush Gupta, Dumitru Erhan, Eleni Shaw, Eric Lau, Frank Belletti, Gabe Barth-Maron, Gregory Shaw, Hakan Erdo- gan, Hakim Sidahmed, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jeff Donahue, Jos ´e Lezama, Kory Mathewson, Kurtis David, Matthieu Kim Lorrain, Marc van Zee, Medhini Narasimhan, Mi...

work page 2024

[52] [52]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [53]

Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023. 4

work page arXiv 2023

[54] [54]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[55] [55]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 4

work page 2022

[57] [57]

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end- to-end speech synthesis.arXiv preprint arXiv:1703.10135,

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024. 4

work page 2024

[59] [59]

Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024

Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024. 4

work page arXiv 2024

[60] [60]

Art-v: Auto-regressive text-to- video generation with diffusion models

Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jian- min Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to- video generation with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024. 4

work page 2024

[61] [61]

Planning and understanding: A computa- tional approach to human reasoning

Robert Wilensky. Planning and understanding: A computa- tional approach to human reasoning. 1983. 3, 16, 18

work page 1983

[62] [62]

Mm- storyagent: Immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio

Xuenan Xu, Jiahao Mei, Chenliang Li, Yuning Wu, Ming Yan, Shaopeng Lai, Ji Zhang, and Mengyue Wu. Mm- storyagent: Immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio. arXiv preprint arXiv:2503.05242, 2025. 1, 4

work page arXiv 2025

[63] [63]

Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024

Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024. 1, 4

work page arXiv 2024

[64] [64]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023

Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023. 4

work page arXiv 2023

[66] [66]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[67] [67]

Let storytelling tell vivid stories: An expressive and fluent multimodal story- teller.arXiv preprint arXiv:2403.07301, 2024

Chuanqi Zang, Jiji Tang, Rongsheng Zhang, Zeng Zhao, Tangjie Lv, Mingtao Pei, and Wei Liang. Let storytelling tell vivid stories: An expressive and fluent multimodal story- teller.arXiv preprint arXiv:2403.07301, 2024. 1, 4

work page arXiv 2024

[68] [68]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024. 4

work page 2024

[69] [69]

Dialogue director: Bridging the gap in dialogue visualization for multimodal storytelling.arXiv preprint arXiv:2412.20725, 2024

Min Zhang, Zilin Wang, Liyan Chen, Kunhong Liu, and Jun- cong Lin. Dialogue director: Bridging the gap in dialogue visualization for multimodal storytelling.arXiv preprint arXiv:2412.20725, 2024. 1, 4

work page arXiv 2024

[70] [70]

the words ‘KEEP OFF THE GRASS

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023. 4

work page arXiv 2023

[71] [71]

Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024

Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, and Chunhua Shen. Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024. 3

work page arXiv 2024

[72] [72]

Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025

Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025. 4

work page 2025

[73] [73]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[74] [74]

Video A”, “Video B

Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang. Vlogger: Make your dream a vlog. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8806–8817, 2024. 3 A. Supplementary Material A.1. Ethics Statement Ethics Statement All audio samples used in this work are limited to short...

work page 2024