DrawVideo: Generating Long Video from Storyboard Keyframe Sketches
Pith reviewed 2026-05-25 02:38 UTC · model grok-4.3
The pith
DrawVideo turns storyboard keyframe sketches into coherent long videos by generating shots independently with sketch, appearance, and motion controls.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch that sets pose and layout, an appearance prompt that fixes identity and scene, and a motion prompt that directs temporal changes. The method follows a global multi-shot, local single-sketch strategy that first produces a structure-aligned reference keyframe, expands the motion prompt into derivative keyframes, and then generates the intervening clips to assemble each shot. Experiments on the introduced SketchLongVideo dataset demonstrate resulting structural controllability, appearance consistency, visual stability, and coherent long-video output.
What carries the argument
The hierarchical global multi-shot, local single-sketch strategy that builds each shot from a sketch-defined reference keyframe plus expanded motion keyframes before synthesizing clips between them.
If this is right
- Sketches give direct control over pose, composition, and layout for each shot instead of relying on text alone.
- Appearance prompts maintain consistent character and scene identity across independently generated shots.
- Motion prompts allow targeted guidance of action states within each shot without overloading a single prompt.
- The approach supports longer videos by avoiding the limits of single-prompt text-to-video generation.
Where Pith is reading between the lines
- The method could be adapted for iterative storyboard refinement where users adjust sketches and regenerate only affected shots.
- It points toward direct pipelines from traditional animation pre-production storyboards into final video output.
- Independent shot generation may allow parallel computation or editing of individual segments in production workflows.
Load-bearing premise
Generating shots independently through the hierarchical sketch-plus-prompt process will produce overall narrative coherence without extra cross-shot consistency mechanisms.
What would settle it
Generate a multi-shot video from a storyboard sequence and inspect whether character identities, scene elements, or action continuity break at shot boundaries; visible mismatches would show the claim does not hold.
Figures
read the original abstract
Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing text-to-video methods often rely on a single long prompt, limiting control over pose, composition, layout, and motion. We propose DrawVideo, a sketch-guided, storyboard-driven framework for controllable long-video generation. DrawVideo decomposes long videos into independently controllable shots, each defined by a black-and-white sketch, an appearance prompt, and a motion prompt. The sketch controls pose and layout, the appearance prompt defines identity, scene, and style, and the motion prompt guides temporal dynamics. DrawVideo follows a hierarchical 'global multi-shot, local single-sketch' strategy: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot. We also introduce SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments show that DrawVideo achieves strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DrawVideo, a sketch-guided storyboard-driven framework for controllable long-video generation. It decomposes long videos into shots defined by black-and-white sketches (for pose/layout), appearance prompts (for identity/scene/style), and motion prompts (for dynamics). The method uses a hierarchical 'global multi-shot, local single-sketch' strategy: generate a structure-aligned reference keyframe, expand the motion prompt into derivative keyframes, then synthesize and interpolate clips between keyframes to form each shot. It also introduces the SketchLongVideo dataset constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experiments are reported to demonstrate strong structural controllability, appearance consistency, visual stability, and coherent long-video generation.
Significance. If the performance claims hold with supporting quantitative evidence, the work would offer a practical user-controllable approach to long video synthesis that improves on single-prompt text-to-video methods by incorporating explicit sketch and prompt controls per shot. The introduction of SketchLongVideo as the first dataset for this task is a clear positive contribution that could enable future research. The hierarchical decomposition strategy is a reasonable engineering response to the challenges of extended temporal coherence.
major comments (2)
- [Abstract] Abstract (central claim paragraph): the assertion of 'coherent long-video generation' rests on the unelaborated assumption that independently generated shots (via local single-sketch pipelines) will maintain global character identity, layout consistency, and narrative progression when concatenated. No cross-shot attention, global consistency module, or post-alignment step is described, making this the load-bearing weakest link for the overall coherence claim.
- [Abstract] Abstract (experiments sentence): the reported 'strong structural controllability, appearance consistency, visual stability, and coherent long-video generation' are stated without any accompanying quantitative metrics, ablation results, or error analysis, preventing verification that the hierarchical strategy actually delivers the claimed performance.
minor comments (2)
- The abstract provides no implementation details, model architecture diagrams, or training procedure, which limits assessment of reproducibility.
- Dataset construction steps (shot detection, keyframe extraction, etc.) are listed but without quantitative statistics on the resulting SketchLongVideo corpus (e.g., number of shots, average length, diversity metrics).
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and will revise the abstract to ensure claims are precisely supported by the method and experiments described in the full manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract (central claim paragraph): the assertion of 'coherent long-video generation' rests on the unelaborated assumption that independently generated shots (via local single-sketch pipelines) will maintain global character identity, layout consistency, and narrative progression when concatenated. No cross-shot attention, global consistency module, or post-alignment step is described, making this the load-bearing weakest link for the overall coherence claim.
Authors: We agree that the abstract does not describe an automatic cross-shot mechanism. The method generates shots independently using per-shot sketches, appearance prompts, and motion prompts; global character identity is maintained by the user providing consistent appearance prompts for recurring elements across the storyboard, while narrative progression follows the user-defined shot sequence. No cross-shot attention or post-alignment is implemented. We will revise the abstract to clarify that coherence relies on consistent user-specified controls rather than an implicit global module, and to avoid overstating automatic long-range consistency. revision: yes
-
Referee: [Abstract] Abstract (experiments sentence): the reported 'strong structural controllability, appearance consistency, visual stability, and coherent long-video generation' are stated without any accompanying quantitative metrics, ablation results, or error analysis, preventing verification that the hierarchical strategy actually delivers the claimed performance.
Authors: Abstracts are concise summaries; the full manuscript's Experiments section reports quantitative results (keypoint alignment for structural controllability, CLIP-based similarity for appearance consistency, temporal difference metrics for stability) along with ablations on the hierarchical components and error analysis via failure cases. We will revise the abstract to qualify the claims as 'demonstrated in experiments' and ensure wording aligns directly with the quantitative evidence presented in the paper body. revision: yes
Circularity Check
No circularity detected; derivation is compositional and self-contained
full rationale
The provided abstract and strategy description present DrawVideo as a decomposition into independent per-shot generation using sketches, appearance prompts, and motion prompts, followed by hierarchical keyframe expansion and interpolation. No equations, fitted parameters renamed as predictions, self-citations invoked as uniqueness theorems, or ansatzes smuggled via prior work appear. The central claim of coherent long-video generation is supported by the described pipeline and introduced dataset rather than reducing to its own inputs by construction. This matches the default expectation of a non-circular paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Flipsketch: Flipping static drawings to text-guided sketch animations
Hmrishav Bandyopadhyay and Yi-Zhe Song. Flipsketch: Flipping static drawings to text-guided sketch animations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[2]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025. URL https: //arxiv.org/abs/2506.15742
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
John Canny. A computational approach to edge detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8(6):679–698, 1986. doi: 10.1109/TPAMI.1986. 4767851
-
[4]
Longanimation: Long animation generation with dynamic global-local memory
Nan Chen, Mengqi Huang, Yihao Meng, and Zhendong Mao. Longanimation: Long animation generation with dynamic global-local memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10032–10042, 2025
work page 2025
-
[5]
Qiyuan Du, Yiping Duan, Zhipeng Xie, Xiaoming Tao, Linsu Shi, and Zhijuan Jin. Optical flow- based spatiotemporal sketch for video representation: A novel framework.IEEE Transactions on Circuits and Systems for Video Technology, 34(8):6963–6977, 2024. doi: 10.1109/TCSVT. 2023.3349130
-
[6]
FFmpeg Developers. FFmpeg Documentation. https://ffmpeg.org/documentation. html, 2026. 11
work page 2026
-
[7]
Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, and Ziwei Liu. Longvie: Multimodal-guided controllable ultra-long video generation.arXiv preprint arXiv:2508.03694, 2025
-
[8]
Dan B Goldman, Brian Curless, David Salesin, and Steven M Seitz. Schematic storyboarding for video visualization and editing.Acm transactions on graphics (tog), 25(3):862–871, 2006
work page 2006
-
[9]
Tevis: Translating text synopses to video storyboards
Xu Gu, Yuchong Sun, Feiyue Ni, Shizhe Chen, Xihua Wang, Ruihua Song, Boyuan Li, and Xiang Cao. Tevis: Translating text synopses to video storyboards. InProceedings of the 31st ACM International Conference on Multimedia, pages 4968–4979, 2023
work page 2023
-
[10]
Alan Hanjalic. Shot-boundary detection: Unraveled and resolved?IEEE Transactions on Circuits and Systems for Video Technology, 12(2):90–105, 2002. doi: 10.1109/76.988656
-
[11]
Streamingt2v: Consistent, dynamic, and extendable long video generation from text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025
work page 2025
-
[12]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020
work page 2020
-
[13]
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman
Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Customized storytelling video generation via multi-agent collaboration. arXiv preprint arXiv:2411.04925, 2024
-
[14]
Zhitong Huang, Mohan Zhang, and Jing Liao. Lvcd: Reference-based lineart video colorization with diffusion models.ACM Transactions on Graphics, 43(6), 2024. doi: 10.1145/3687910. URLhttps://doi.org/10.1145/3687910
-
[15]
Lifan Jiang, Shuang Chen, Boxi Wu, Deng Cai, and Jiahui Zhang. Vidsketch: Hand-drawn sketch-driven video generation with diffusion control.Neural Networks, 196:108465, 2026
work page 2026
-
[16]
Picture that sketch: Photorealistic image generation from abstract sketches
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. Picture that sketch: Photorealistic image generation from abstract sketches. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[17]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Renjie Li, Dongsheng Zhang, Yuchen Guo, Fang Li, Hao Zhang, Yuhang Wang, Yixuan Li, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Chengxuan Li, Di Huang, Zeyu Lu, Yang Xiao, Qingqi Pei, and Lei Bai. A survey on long video generation: Challenges, methods, and prospects.arXiv preprint arXiv:2403.16407, 2024
-
[19]
Xiaoyu Li, Bo Zhang, Jing Liao, and Pedro V . Sander. Deep sketch-guided cartoon video inbetweening.IEEE Transactions on Visualization and Computer Graphics, 28(8):2938–2952, 2022
work page 2022
-
[20]
Evaluation of text-to-video generation models: A dynamics perspective
Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wang- meng Zuo, Qixiang Ye, and Jingdong Wang. Evaluation of text-to-video generation models: A dynamics perspective. InAdvances in Neural Information Processing Systems, volume 37,
-
[21]
doi: 10.52202/079017-3483
-
[22]
Sketchvideo: Sketch-based video generation and editing
Feng-Lin Liu, Hongbo Fu, Xintao Wang, Weicai Ye, Pengfei Wan, Di Zhang, and Lin Gao. Sketchvideo: Sketch-based video generation and editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[23]
Ollama Documentation.https://ollama.com/, 2026
Ollama. Ollama Documentation.https://ollama.com/, 2026
work page 2026
-
[24]
GPT-4.1 mini Model Documentation
OpenAI. GPT-4.1 mini Model Documentation. https://platform.openai.com/docs/ models/gpt-4.1-mini, 2026. 12
work page 2026
-
[25]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Conference on Learning Representations, 2024
work page 2024
-
[26]
PySceneDetect API Documentation
PySceneDetect Developers. PySceneDetect API Documentation. https://www. scenedetect.com/api/, 2026
work page 2026
-
[27]
Animeshooter: A multi-shot animation dataset for reference-guided video generation, 2025
Lu Qiu, Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. Animeshooter: A multi-shot animation dataset for reference-guided video generation, 2025. URL https:// arxiv.org/abs/2506.03126
-
[28]
Linzi Qu, Jiaxiang Shang, Miu-Ling Lam, and Hongbo Fu. Controllable human video generation from sparse sketches.IEEE Transactions on Visualization and Computer Graphics, 31(10): 7243–7256, 2025
work page 2025
-
[29]
Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervi- sion. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings o...
work page 2021
-
[31]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Videosketcher: Video models prior enable versatile sequential sketch generation, 2026
Hui Ren, Yuval Alaluf, Omer Bar-Tal, Alexander Schwing, Antonio Torralba, and Yael Vinker. Videosketcher: Video models prior enable versatile sequential sketch generation, 2026
work page 2026
-
[33]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022
work page 2022
-
[34]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvances in Neural Information Processing Systems, volume 35, pages 36479–36494, 2022
work page 2022
-
[35]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
U.S. Copyright Office. More Information on Fair Use. https://www.copyright.gov/ fair-use/more-info.html, 2026
work page 2026
-
[37]
Yael Vinker, Tamar Rott Shaham, Kristine Zheng, Alex Zhao, Judith E. Fan, and Antonio Torralba. Sketchagent: Language-driven sequential sketch generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[38]
Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023
-
[39]
Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Si- mon Shaolei Du, and Yelong Shen. Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13629–13638, 2025
work page 2025
-
[40]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of- the-ar...
-
[41]
World Intellectual Property Organization. Limitations and Exceptions. https://www.wipo. int/en/web/copyright/limitations/index, 2026
work page 2026
-
[42]
Videoauteur: Towards long narrative video generation
Junfei Xiao, Feng Cheng, Lu Qi, Liangke Gui, Yang Zhao, Shanchuan Lin, Jiepeng Cen, Zhibei Ma, Alan Yuille, and Lu Jiang. Videoauteur: Towards long narrative video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19163– 19173, 2025
work page 2025
-
[43]
A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024
Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models.ACM Computing Surveys, 57(2):1–42, 2024
work page 2024
-
[44]
An Yang et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Nuwa-xl: Diffusion over diffu- sion for extremely long video generation
Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. Nuwa-xl: Diffusion over diffu- sion for extremely long video generation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1309–1320, 2023
work page 2023
-
[47]
A feature-based algorithm for detecting and classifying scene breaks
Ramin Zabih, Justin Miller, and Kevin Mai. A feature-based algorithm for detecting and classifying scene breaks. InProceedings of the Third ACM International Conference on Multimedia, pages 189–200. ACM Press, 1995. doi: 10.1145/217279.215266
-
[48]
Haichao Zhang, Tao Chen, Gang Yu, and Guozhong Luo. Sketch me a video, 2021
work page 2021
-
[49]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023
work page 2023
-
[50]
Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, and Boxin Shi. Stage: Storyboard- anchored generation for cinematic multi-shot narrative.arXiv preprint arXiv:2512.12372, 2025
-
[51]
Efros, Eli Shechtman, and Oliver Wang
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018
work page 2018
-
[52]
Yudian Zheng, Xiaodong Cun, Menghan Xia, and Chi-Man Pun. Sketch video synthesis. Computer Graphics F orum, 43(2), 2024
work page 2024
-
[53]
Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou. Storydiffusion: Consistent self-attention for long-range image and video generation.Advances in Neural Information Processing Systems, 37:110315–110340, 2024
work page 2024
-
[54]
X. Zhu, X. Yang, S. Zheng, Z. Zhang, F. Gao, J. Huang, and J. Chen. Vector sketch animation generation with differentialable motion trajectories.Computer Graphics F orum, 2026. doi: 10.1111/cgf.70335. 14 Appendix A Dataset Construction and Usage Statement This section provides additional details about the construction, organization, processing protocol, a...
-
[55]
Load the reference keyframeI 0 k
-
[56]
Resize and preprocess the input image
-
[57]
Encode the reference image into latent space using the FLUX V AE encoder
-
[58]
Encode the conversion prompt using the FLUX text encoders
-
[59]
Inject the reference latent into the conditioning pathway through the ReferenceLatent module
-
[60]
Perform reference-conditioned denoising with FLUX.1 Kontext
-
[61]
Decode the generated latent representation into RGB space. 25 FLUX Kontext Backbone.We use FLUX.1 Kontext as the image generation backbone. In our implementation, the diffusion backbone is loaded from: flux1-dev-kontext_fp8_scaled.safetensors Text conditioning is performed using dual text encoders: clip_l.safetensors t5xxl_fp16.safetensors The latent auto...
-
[62]
Load the starting derivative keyframeI i−1 k
-
[63]
Load the ending derivative keyframeI i k
-
[64]
Encode the structured dynamic prompt using the Wan text encoder
-
[65]
Construct the first-last-frame latent initialization using the Wan first-last-frame conditioning module
-
[66]
Perform hierarchical latent video diffusion using a high-noise stage followed by a low-noise refinement stage
-
[67]
Decode the generated latent video representation into RGB frames using the Wan V AE decoder
-
[68]
28 Wan 2.2 Backbone.We use Wan 2.2 as the latent video diffusion backbone
Concatenate local video clips into a complete shot video. 28 Wan 2.2 Backbone.We use Wan 2.2 as the latent video diffusion backbone. In our implementation, two diffusion models are used sequentially: •wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors, •wan2.2_i2v_low_noise_14B_fp8_scaled.safetensors. Text conditioning is encoded using: umt5_xxl_fp8_e4m3fn_...
-
[69]
high-noise motion generation stage,
-
[70]
low-noise refinement stage. The high-noise stage primarily synthesizes: • large-scale motion transitions, • temporal dynamics, • coarse motion structure. The low-noise stage further refines: • character appearance, • temporal consistency, • local visual details, • motion smoothness. This hierarchical generation strategy improves temporal coherence compare...
work page 2047
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.