Character-Centered Dialogue Generation from Scene-Level Prompts
Pith reviewed 2026-05-22 13:28 UTC · model grok-4.3
The pith
A modular pipeline generates character-consistent dialogue from scene prompts by grounding an LLM in visual semantics and a recursive narrative memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that combining a pretrained vision-language encoder's high-level scene semantics with structured setting and behavior prompts, guided by a Recursive Narrative Bank that accumulates speaker-specific history, lets a large language model produce expressive, contextually consistent dialogue that can be rendered as character-conditioned speech, all without fine-tuning or additional supervision.
What carries the argument
The Recursive Narrative Bank: a speaker-aware, temporally structured memory that accumulates each character's dialogue history to maintain evolving goals, social context, and narrative roles across scenes.
If this is right
- The same pipeline can produce coherent, fully voiced audiovisual stories across diverse settings and story lengths without retraining.
- Dialogue remains grounded in both the visual content of each scene and the accumulated narrative context.
- Character expression and emotional tone evolve naturally with the story rather than resetting per scene.
- The framework scales to new story domains by swapping only the underlying story generation and speech synthesis models.
Where Pith is reading between the lines
- The same memory structure could support real-time interactive stories where user choices update the narrative bank on the fly.
- Pairing this dialogue module with controllable video generators might reduce visible mismatches between spoken words and on-screen actions.
- Extending the bank to track non-verbal cues extracted from the vision encoder could further tighten audiovisual alignment.
Load-bearing premise
A pretrained vision-language encoder can extract high-level visual semantics that are sufficient to guide a large language model in producing expressive dialogue consistent with both the current scene prompts and prior narrative context, without any fine-tuning.
What would settle it
Generate videos from a series of scene prompts and check whether the spoken dialogue often fails to match visible actions or character traits visible in the rendered scenes, or whether later scenes contradict earlier dialogue history.
Figures
read the original abstract
Recent advances in scene-based video generation enable coherent visual narratives from structured prompts, yet a key aspect of storytelling -- character-driven dialogue and speech -- remains underexplored. We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded dialogue, enriching scene-based storytelling with natural voice and character expression. Our method takes a pair of prompts per scene, defining the setting and character behavior. While a story generation model such as Text2Story produces the visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and a representative scene image. A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis. To maintain contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank, a speaker-aware, temporally structured memory that accumulates each character's dialogue history. Inspired by Script Theory, this design enables dialogue that reflects evolving goals, social context, and narrative roles. Finally, we render each utterance as expressive, character-conditioned speech, producing fully voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings, providing a scalable solution for coherent, character-grounded audiovisual storytelling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a modular, training-free pipeline for generating character-centered dialogue from scene-level prompts in scene-based video generation. It uses a pretrained vision-language encoder to extract high-level visual semantics from a representative scene image, fuses these with action-level and character-behavior prompts to guide an LLM in synthesizing expressive utterances, maintains cross-scene consistency via a new Recursive Narrative Bank memory module that accumulates speaker-aware dialogue history, and renders the output as character-conditioned speech to produce fully voiced multimodal narratives, drawing inspiration from Script Theory.
Significance. If empirically validated, the work could meaningfully advance character-driven audiovisual storytelling by filling a gap in dialogue generation for coherent scene-based video systems. The training-free design leveraging off-the-shelf components and the conceptual contribution of the Recursive Narrative Bank for temporal and emotional coherence across scenes represent potential strengths for scalability and generalization.
major comments (2)
- Abstract: The central claims that the pipeline produces 'expressive, character-consistent utterances' and 'visually and auditorily grounded dialogue' maintained via the Recursive Narrative Bank are unsupported, as the manuscript provides no quantitative evaluations, ablation studies, user studies, or baseline comparisons to substantiate consistency, expressiveness, or visual grounding.
- Method description (pipeline overview): The assumption that a pretrained vision-language encoder alone extracts semantics sufficient to guide an LLM toward prompt-grounded and expressive dialogue without fine-tuning or supervision is load-bearing for the training-free claim but remains untested; if the encoder outputs are too coarse, the downstream LLM synthesis cannot compensate as described.
minor comments (1)
- Abstract: Clarify the exact update mechanism and data structure of the Recursive Narrative Bank (e.g., how speaker awareness and temporal ordering are implemented) to aid reproducibility, perhaps via pseudocode.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate to strengthen the presentation of the training-free pipeline and the Recursive Narrative Bank.
read point-by-point responses
-
Referee: [—] Abstract: The central claims that the pipeline produces 'expressive, character-consistent utterances' and 'visually and auditorily grounded dialogue' maintained via the Recursive Narrative Bank are unsupported, as the manuscript provides no quantitative evaluations, ablation studies, user studies, or baseline comparisons to substantiate consistency, expressiveness, or visual grounding.
Authors: We agree that the current manuscript supports its claims primarily through qualitative examples and illustrative cases rather than quantitative metrics or user studies. The work focuses on introducing a modular, training-free framework, with the Recursive Narrative Bank as a conceptual contribution for cross-scene consistency. In the revision, we will qualify the abstract language to reflect the available evidence (e.g., 'illustrated through examples') and add a new section presenting extended qualitative results across multiple story domains along with a small-scale human preference study comparing outputs with and without the memory bank. revision: yes
-
Referee: [—] Method description (pipeline overview): The assumption that a pretrained vision-language encoder alone extracts semantics sufficient to guide an LLM toward prompt-grounded and expressive dialogue without fine-tuning or supervision is load-bearing for the training-free claim but remains untested; if the encoder outputs are too coarse, the downstream LLM synthesis cannot compensate as described.
Authors: The vision-language encoder is used to supply high-level visual semantics that are explicitly fused with the action-level and character-behavior prompts before being passed to the LLM; the LLM is further guided by structured instructions and the accumulating Recursive Narrative Bank to produce grounded utterances. We acknowledge that this reliance on off-the-shelf components is central to the training-free design and will expand the method section with concrete examples of encoder outputs, the exact fusion prompt template, and a discussion of cases where coarse semantics might limit expressiveness, including how the speaker-aware memory helps mitigate drift across scenes. revision: partial
Circularity Check
No circularity: modular pipeline uses external pretrained components without internal reductions or self-referential derivations
full rationale
The paper presents a training-free modular pipeline that composes off-the-shelf vision-language encoders, LLMs, and a newly introduced Recursive Narrative Bank to generate dialogue from scene prompts and images. No equations, parameter fittings, or derivations are described that would reduce outputs to inputs by construction. Claims rest on the independent capabilities of pretrained models and the explicit accumulation of dialogue history for cross-scene coherence, with no load-bearing self-citations or ansatzes imported from prior author work. The derivation chain is self-contained as a descriptive composition rather than a closed mathematical loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained vision-language encoders extract high-level semantics that are directly useful for guiding dialogue generation in an LLM.
invented entities (1)
-
Recursive Narrative Bank
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis... Recursive Narrative Bank... reference-driven voice synthesis
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our training-free framework generalizes across diverse story settings
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Mochi-1 preview.https://huggingface
Genmo AI. Mochi-1 preview.https://huggingface. co/genmo/mochi-1-preview, 2024. 4
work page 2024
-
[3]
Bark: Text-prompted generative audio model
Suno AI. Bark: Text-prompted generative audio model. https://github.com/suno-ai/bark, 2023. Ac- cessed: 2025-05-13. 4
work page 2023
-
[4]
Csm-1b: Conversational speech model
Sesame AI. Csm-1b: Conversational speech model. https : / / huggingface . co / sesame / csm - 1b,
- [5]
-
[6]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[7]
Samee Arif, Taimoor Arif, Muhammad Saad Haroon, Aam- ina Jamal Khan, Agha Ali Raza, and Awais Athar. The art of storytelling: Multi-agent generative ai for dynamic mul- timodal narratives.arXiv preprint arXiv:2409.11261, 2024. 1, 4
-
[8]
Sound of story: Multi-modal storytelling with audio.arXiv preprint arXiv:2310.19264, 2023
Jaeyeon Bae, Seokhoon Jeong, Seokun Kang, Namgi Han, Jae-Yon Lee, Hyounghun Kim, and Taehwan Kim. Sound of story: Multi-modal storytelling with audio.arXiv preprint arXiv:2310.19264, 2023. 1, 4
-
[9]
Lumiere: A space-time diffu- sion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 4
work page 2024
-
[10]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Align your latents: High-resolution video synthesis with la- tent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22563–22575, 2023. 4
work page 2023
-
[12]
Scripts in memory for text.Cognitive psychology, 11(2): 177–220, 1979
Gordon H Bower, John B Black, and Terrence J Turner. Scripts in memory for text.Cognitive psychology, 11(2): 177–220, 1979. 3, 16, 18
work page 1979
-
[13]
Control-a-video: Controllable text-to-video generation with diffusion models
Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv e-prints, pages arXiv–2305, 2023. 4
work page 2023
-
[14]
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly- scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Seine: Short-to-long video diffu- sion model for generative transition and prediction
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffu- sion model for generative transition and prediction. InThe Twelfth International Conference on Learning Representa- tions, 2023. 4
work page 2023
-
[16]
Uniter: Universal image-text representation learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InEuropean conference on computer vision, pages 104–120. Springer,
-
[17]
Instructblip: Towards general- purpose vision-language models with instruction tuning,
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,
-
[18]
One-minute video generation with test-time training
Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025. 4
work page 2025
-
[19]
Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024. 4
work page 2024
-
[20]
Violet: End-to-end video-language transformers with masked visual-token modeling
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token mod- eling.arXiv preprint arXiv:2111.12681, 2021. 4
-
[21]
Preserve your own correlation: A noise prior for video diffusion models
Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming- Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22930–22941, 2023. 4
work page 2023
-
[22]
Emu video: Factorizing text-to-video generation by explicit image conditioning
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz- ing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 4
-
[23]
Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualiza- tion by llm-guided multi-subject consistent diffusion.arXiv preprint arXiv:2407.12899, 2024. 3
-
[24]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval-augmented video generation.arXiv preprint arXiv:2307.06940, 2023. 3
-
[26]
Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024. 4
-
[27]
Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Is- han Misra, Aishwarya Agrawal, Jacob Devlin, Ross Gir- shick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. InProceedings of the 2016 conference of the North American chapter of the association for com- putational linguistics: Human language technologies, pages 1233–1239, 2016. 3
work page 2016
-
[28]
Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling
Taewon Kang, Divya Kothandaraman, and Ming C Lin. Text2story: Advancing video storytelling with text guidance. arXiv preprint arXiv:2503.06310, 2025. 1, 4, 5, 6, 7, 19
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Text2video-zero: Text- to-image diffusion models are zero-shot video generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 4
work page 2023
-
[30]
Vilt: Vision- and-language transformer without convolution or region su- pervision
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 4
work page 2021
-
[31]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021
work page 2021
-
[32]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 4, 5, 6
work page 2022
-
[33]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023
work page 2023
-
[34]
Lavender: Unifying video- language understanding as masked language modeling
Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video- language understanding as masked language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23119–23129, 2023. 4
work page 2023
-
[35]
Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning
Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,
-
[36]
Xiaochuan Lin and Xiangyong Chen. Improving visual sto- rytelling with multimodal large language models.arXiv preprint arXiv:2407.02586, 2024. 1, 4
-
[37]
Intelligent grimm-open-ended visual storytelling via latent diffusion models
Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yan- feng Wang, and Weidi Xie. Intelligent grimm-open-ended visual storytelling via latent diffusion models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6190–6200, 2024. 3
work page 2024
-
[38]
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xi- aoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language mod- els.Advances in Neural Information Processing Systems, 36:29615–29627, 2023. 4
work page 2023
-
[39]
Mevg: Multi-event video generation with text-to-video models
Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, and Sangpil Kim. Mevg: Multi-event video generation with text-to-video models. In European Conference on Computer Vision, pages 401–418. Springer, 2024. 4
work page 2024
-
[40]
Video generation models as world simula- tors.https : / / openai
OpenAI. Video generation models as world simula- tors.https : / / openai . com / index / video - generation - models - as - world - simulators/,
-
[41]
Synthesizing coherent story with auto-regressive la- tent diffusion models
Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, and Wenhu Chen. Synthesizing coherent story with auto-regressive la- tent diffusion models. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 2920–2930, 2024. 3
work page 2024
-
[42]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petro- vic, and Yuming Du
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sin...
work page 2025
-
[44]
Hier- archical spatio-temporal decoupling for text-to-video gener- ation
Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hier- archical spatio-temporal decoupling for text-to-video gener- ation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 6635–6645, 2024
work page 2024
-
[45]
Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169, 2023. 4
-
[46]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4
work page 2021
-
[47]
Roger C Schank and Robert P Abelson.Scripts, plans, goals, and understanding: An inquiry into human knowledge struc- tures. Psychology press, 2013. 3, 16, 18
work page 2013
-
[48]
Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, An- drew Pierson, Ankush Gupta, Austin Waters, A¨aron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nand- wani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt...
work page 2024
-
[49]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Samuel S Sohn, Danrui Li, Sen Zhang, Che-Jui Chang, and Mubbasir Kapadia. From words to worlds: Trans- forming one-line prompt into immersive multi-modal dig- ital stories with communicative llm agent.arXiv preprint arXiv:2406.10478, 2024. 1, 4
-
[51]
Veo-Team, :, Agrim Gupta, Ali Razavi, Andeep Toor, Ankush Gupta, Dumitru Erhan, Eleni Shaw, Eric Lau, Frank Belletti, Gabe Barth-Maron, Gregory Shaw, Hakan Erdo- gan, Hakim Sidahmed, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jeff Donahue, Jos ´e Lezama, Kory Mathewson, Kurtis David, Matthieu Kim Lorrain, Marc van Zee, Medhini Narasimhan, Mi...
work page 2024
-
[52]
Phenaki: Variable Length Video Generation From Open Domain Textual Description
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin- dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023. 4
-
[54]
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learn- ing, pages 23318–23340. PMLR, 2022. 4
work page 2022
-
[57]
Tacotron: Towards End-to-End Speech Synthesis
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end- to-end speech synthesis.arXiv preprint arXiv:1703.10135,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, pages 1–20, 2024. 4
work page 2024
-
[59]
Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained storytelling video gen- eration with retrieval-augmented motion adaptation.arXiv preprint arXiv:2411.16657, 2024. 4
-
[60]
Art-v: Auto-regressive text-to- video generation with diffusion models
Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jian- min Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to- video generation with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024. 4
work page 2024
-
[61]
Planning and understanding: A computa- tional approach to human reasoning
Robert Wilensky. Planning and understanding: A computa- tional approach to human reasoning. 1983. 3, 16, 18
work page 1983
-
[62]
Xuenan Xu, Jiahao Mei, Chenliang Li, Yuning Wu, Ming Yan, Shaopeng Lai, Ji Zhang, and Mengyue Wu. Mm- storyagent: Immersive narrated storybook video generation with a multi-agent paradigm across text, image and audio. arXiv preprint arXiv:2503.05242, 2025. 1, 4
-
[63]
Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed-story: Multimodal long story generation with large language model.arXiv preprint arXiv:2407.08683, 2024. 1, 4
-
[64]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation.arXiv preprint arXiv:2303.12346, 2023. 4
-
[66]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[67]
Chuanqi Zang, Jiji Tang, Rongsheng Zhang, Zeng Zhao, Tangjie Lv, Mingtao Pei, and Wei Liang. Let storytelling tell vivid stories: An expressive and fluent multimodal story- teller.arXiv preprint arXiv:2403.07301, 2024. 1, 4
-
[68]
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Com- puter Vision, pages 1–15, 2024. 4
work page 2024
-
[69]
Min Zhang, Zilin Wang, Liyan Chen, Kunhong Liu, and Jun- cong Lin. Dialogue director: Bridging the gap in dialogue visualization for multimodal storytelling.arXiv preprint arXiv:2412.20725, 2024. 1, 4
-
[70]
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation.arXiv preprint arXiv:2305.13077, 2023. 4
-
[71]
Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, and Chunhua Shen. Moviedreamer: Hierarchical generation for coherent long vi- sual sequence.arXiv preprint arXiv:2407.16655, 2024. 3
-
[72]
Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self- attention for long-range image and video generation.Ad- vances in Neural Information Processing Systems, 37: 110315–110340, 2025. 4
work page 2025
-
[73]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang. Vlogger: Make your dream a vlog. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8806–8817, 2024. 3 A. Supplementary Material A.1. Ethics Statement Ethics Statement All audio samples used in this work are limited to short...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.