pith. sign in

arxiv: 2605.17019 · v1 · submitted 2026-05-16 · 💻 cs.CV

StreamingEffect: Real-Time Human-Centric Video Effect Generation

Pith reviewed 2026-05-19 20:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords real-time video editinghuman-centric effectsmodel distillationcausal autoregressive generationstreaming videokeyframe controlvideo effect dataset
0
0 comments X p. Extension

The pith

StreamingEffect distills a bidirectional video editing teacher into a causal student that generates real-time human-centric effects at 720p on one GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem of generating expressive video effects in real time for live applications like streaming and vlogging, where preserving the person's identity, the background, and smooth motion across frames is essential. It does this by first training a high-quality bidirectional teacher model on an in-context editing setup, then distilling that teacher down to a causal autoregressive student that works in a single forward pass. The authors also add keyframe control so that reference effect frames can be fed in during the stream and carried forward consistently. To overcome the lack of training data they release VideoEffect-130K, a large collection of effect and editing videos spanning 600 categories. If the distillation succeeds, high-quality streaming effect editing becomes feasible on modest hardware without needing future frames or heavy compute.

Core claim

A bidirectional teacher model trained for human-centric video effect editing can be distilled into a causal autoregressive student that operates in one pass with only four sampling steps while retaining comparable quality in identity preservation, background fidelity, and temporal consistency, and the addition of online keyframe injection makes the system interactive for live streams.

What carries the argument

Causal autoregressive student distilled from a bidirectional in-context editing teacher, accelerated to four sampling steps and augmented with keyframe control that propagates reference effects through the video stream.

If this is right

  • 720p video effect editing runs in real time on a single H200 GPU.
  • Keyframe control lets users inject reference effect frames online that propagate consistently through the live stream.
  • The VideoEffect-130K dataset supplies training material across 600 effect categories for human-centric editing tasks.
  • Reducing diffusion steps from 50 to 4 yields the speed needed for streaming without retraining the core architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same teacher-to-causal-student pattern could be tested on other video-to-video tasks such as style transfer or object insertion to see if real-time performance generalizes.
  • Further quantization or pruning of the student might push the same pipeline onto consumer laptops or mobile devices while keeping acceptable quality.
  • Because the method works with one-pass autoregressive generation, it could reduce end-to-end latency in live video pipelines that currently rely on bidirectional processing.

Load-bearing premise

The distilled causal student model preserves human identity, background content, and temporal consistency at a level comparable to the bidirectional teacher when operating in a streaming, one-pass regime.

What would settle it

A direct frame-by-frame comparison on held-out test videos where the causal student version shows visibly worse identity drift or flickering artifacts than the bidirectional teacher under identical input conditions.

Figures

Figures reproduced from arXiv: 2605.17019 by Cheng Liu, Mike Zheng Shou, Yiren Song, Yuxin Jiang.

Figure 1
Figure 1. Figure 1: Given an incoming human-centric video stream, StreamingEffect supports interactive [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of StreamingEffect. (a) Bidirectional Teacher Training: A bidirectional teacher is trained with reference-conditioned in-context video editing. (b) Causal Student Distil￾lation: The teacher is distilled into a causal autoregressive student for streaming generation. (c) Sliding-window Autoregressive Inference: The student edits incoming video chunks online with cached context and propagates effects… view at source ↗
Figure 3
Figure 3. Figure 3: Construction and statistics of VideoEffect-130K. The dataset contains 130K paired human-centric videos, including 70K rendered effect samples and 60K general editing samples across about 600 categories. It is built with a hybrid synthetic-and-rendered pipeline, and each sample consists of a source video, a reference effect image, and a target edited video. clean in the cache, whereas the current chunk rema… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of StreamingEffect. (a) Image-guided: propagating reference-image effects while preserving identity. (b) Text-guided: following text prompts to generate diverse visual effects. Readers can click and play video clips using Adobe Acrobat. and 60K general-editing samples. The effect-rendering subset covers feature-level decorations, environmental effects, wearable accessories, style filter… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison against open-source (Kiwi-Edit, OmniWeaving) and commercial [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative ablation across distillation stages. Bidirectional: 50-step teacher with full attention. Causal AR 4/50 steps: Stage-1 causal student at 4 or 50 denoising steps. Self-Forcing: our final Stage-2 student (4 steps). Readers can click and play the video clips using Adobe Acrobat [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Streaming video effect generation is highly desirable for live human-centric applications such as e-commerce streaming, entertainment, and vlogging, yet remains difficult due to the lack of suitable data and deployable editing models. Unlike generic video generation, this task requires real-time video-to-video editing that adds expressive effects while preserving human identity, background content, and temporal consistency. Existing acceleration efforts mainly focus on text-to-video generation, while efficient distillation for video editing remains largely underexplored. In this paper, we present \textbf{StreamingEffect}, a real-time human-centric streaming video effect framework. We adopt an in-context video editing architecture and train a high-quality bidirectional teacher, then distill it into a causal autoregressive student and further reduce sampling from 50 steps to 4 steps. We also introduce keyframe control, allowing reference effect frames to be injected online and propagated through the stream for interactive editing. To address the data bottleneck, we construct \textbf{VideoEffect-130K}, to our knowledge the largest human-centric video effect dataset, containing 70K effect videos and 60K editing videos across 600 effect categories curated from short-video and editing platforms. Experiments show that our method enables real-time, high-quality 720p video editing on a single H200 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents StreamingEffect, a framework for real-time human-centric streaming video effect generation. It trains a bidirectional teacher on an in-context video editing architecture, distills it into a causal autoregressive student reduced to 4 sampling steps, introduces keyframe control for online reference injection, and releases the VideoEffect-130K dataset (70K effect videos + 60K editing videos across 600 categories). Experiments claim to show real-time high-quality 720p editing on a single H200 GPU while preserving identity, background content, and temporal consistency.

Significance. If the central claim holds, the work could advance deployable live video editing for applications such as streaming and vlogging. Notable strengths include the construction of a large-scale human-centric effect dataset and the extension of distillation techniques to a causal streaming editing setting, both of which could support reproducible follow-up research.

major comments (1)
  1. [Experiments] Experiments section: The central claim that the 4-step causal student preserves human identity, background fidelity, and temporal consistency at levels comparable to the bidirectional teacher in a strictly one-pass streaming regime lacks direct quantitative support. No side-by-side metrics (identity cosine similarity, background PSNR/SSIM, or temporal flicker scores) comparing teacher and student on identical streaming inputs are reported, leaving open the risk of compounding drift or reduced global context in the autoregressive student.
minor comments (2)
  1. [Method] The keyframe control mechanism is introduced but would benefit from an explicit diagram or pseudocode showing online propagation through the causal stream.
  2. [Dataset] Dataset curation details (e.g., exact filtering criteria for the 600 effect categories) could be expanded to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment on the Experiments section below and will incorporate additional quantitative analysis in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim that the 4-step causal student preserves human identity, background fidelity, and temporal consistency at levels comparable to the bidirectional teacher in a strictly one-pass streaming regime lacks direct quantitative support. No side-by-side metrics (identity cosine similarity, background PSNR/SSIM, or temporal flicker scores) comparing teacher and student on identical streaming inputs are reported, leaving open the risk of compounding drift or reduced global context in the autoregressive student.

    Authors: We agree that direct side-by-side quantitative metrics between the bidirectional teacher and the 4-step causal student on identical streaming inputs would provide stronger evidence for the preservation of identity, background fidelity, and temporal consistency. The current manuscript reports the student's performance primarily through comparisons against external baselines, user studies, and qualitative results, without explicit numerical teacher-student comparisons in the one-pass streaming setting. In the revision, we will add a dedicated table (or subsection) reporting identity cosine similarity, background PSNR/SSIM, and temporal flicker scores computed on the same held-out streaming sequences for both models. This addition will directly quantify any potential drift or context limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent dataset construction and experimental validation

full rationale

The paper introduces a new human-centric video effect dataset (VideoEffect-130K) and applies standard knowledge distillation from a bidirectional teacher to a causal student model, followed by step reduction and keyframe control. The central performance claims (real-time 720p editing on H200 GPU) are presented as outcomes of empirical experiments rather than quantities defined by fitted parameters or self-referential equations. No load-bearing steps reduce by construction to inputs; the approach builds on existing distillation techniques but evaluates them in a new streaming setting with externally constructed data and reported metrics. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central performance claim rests on the assumption that standard video diffusion architectures can be distilled while retaining identity and consistency, plus the representativeness of the curated 130K dataset; no explicit free parameters or invented physical entities are described in the abstract.

free parameters (1)
  • sampling steps
    Reduced from 50 to 4 steps to achieve real-time speed; this is a design choice that directly affects the reported latency-quality tradeoff.
invented entities (1)
  • Keyframe control mechanism no independent evidence
    purpose: Allows reference effect frames to be injected online and propagated through the live stream for interactive editing.
    New control interface introduced to enable interactive use in streaming settings.

pith-pipeline@v0.9.0 · 5760 in / 1347 out tokens · 58856 ms · 2026-05-19T20:08:23.611917+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 25 internal anchors

  1. [1]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

  4. [4]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  5. [5]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024

  6. [6]

    Pix2video: Video editing using image diffusion

    Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 23206–23217, 2023

  7. [7]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024. URL https://arxiv. org/abs/2407.01392

  8. [8]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

  9. [9]

    Transanimate: Taming layer diffusion to generate rgba video

    Xuewei Chen, Zhimin Chen, and Yiren Song. Transanimate: Taming layer diffusion to generate rgba video. arXiv preprint arXiv:2503.17934, 2025

  10. [10]

    Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

    Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  12. [12]

    Dollar: Few-step video generation via distillation and latent reward optimization

    Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Krishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, and Yuchen Liu. Dollar: Few-step video generation via distillation and latent reward optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17961–17971, 2025

  13. [13]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URLhttps://arxiv.org/abs/2403.03206

  14. [14]

    TokenFlow: Consistent Diffusion Features for Consistent Video Editing

    Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023. 10

  15. [15]

    Emu video: Factorizing text-to-video generation by explicit image conditioning.arXiv preprint arXiv:2311.10709, 2023

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning.arXiv preprint arXiv:2311.10709, 2023

  16. [16]

    Relationadapter: Learning and transferring visual relation with diffusion transformers.arXiv preprint arXiv:2506.02528, 2025

    Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, and Yin Zhang. Relationadapter: Learning and transferring visual relation with diffusion transformers.arXiv preprint arXiv:2506.02528, 2025

  17. [17]

    Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks.arXiv preprint arXiv:2501.15891, 2025

    Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Chuang Zhang, and Jiaming Liu. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks.arXiv preprint arXiv:2501.15891, 2025

  18. [18]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

  19. [19]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  20. [20]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/ abs/2207.12598

  21. [21]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  22. [22]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

  23. [23]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.arXiv:2204.03458, 2022

  24. [24]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  25. [25]

    Vace: All-in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

  26. [26]

    Imagic: Text-based real image editing with diffusion models

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023

  27. [27]

    Kling-Omni Technical Report

    Kling Team. Kling-Omni Technical Report, 2025. URLhttps://arxiv.org/abs/2512.16776

  28. [28]

    Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

    Akio Kodaira, Tingbo Hou, Ji Hou, Markos Georgopoulos, Felix Juefei-Xu, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

  29. [29]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  30. [30]

    Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

    Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

  31. [31]

    Vfxmaster: Unlocking dynamic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025

    Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, et al. Vfxmaster: Unlocking dynamic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025

  32. [32]

    Ic-effect: Precise and efficient video effects editing via in-context learning.arXiv preprint arXiv:2512.15635, 2025

    Yuanhang Li, Yiren Song, Junzhe Bai, Xinran Liang, Hu Yang, Libiao Jin, and Qi Mao. Ic-effect: Precise and efficient video effects editing via in-context learning.arXiv preprint arXiv:2512.15635, 2025

  33. [33]

    Look- ing backward: Streaming video-to-video translation with feature banks.arXiv preprint arXiv:2405.15757, 2024

    Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Look- ing backward: Streaming video-to-video translation with feature banks.arXiv preprint arXiv:2405.15757, 2024

  34. [34]

    Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

  35. [35]

    Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

    Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026. 11

  36. [36]

    Streaming autoregressive video generation via diagonal distillation.arXiv preprint arXiv:2603.09488, 2026

    Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Streaming autoregressive video generation via diagonal distillation.arXiv preprint arXiv:2603.09488, 2026

  37. [37]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  38. [38]

    Video-p2p: Video editing with cross-attention control

    Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024

  39. [39]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  40. [40]

    Instaflow: One step is enough for high- quality diffusion-based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high- quality diffusion-based text-to-image generation. InInternational Conference on Learning Representations, 2024

  41. [41]

    Vfx creator: Animated visual effect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025

    Xinyu Liu, Ailing Zeng, Wei Xue, Harry Yang, Wenhan Luo, Qifeng Liu, and Yike Guo. Vfx creator: Animated visual effect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025

  42. [42]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  43. [43]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  44. [44]

    Follow your pose: Pose-guided text-to-video generation using pose-free videos

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4117–4125, 2024

  45. [45]

    Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

    Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024

  46. [46]

    arXiv preprint arXiv:2506.04590 , year =

    Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Junhao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, and Qifeng Chen. Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

  47. [47]

    Follow-your-click: Open-domain regional image animation via motion prompts

    Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, et al. Follow-your-click: Open-domain regional image animation via motion prompts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6018–6026, 2025

  48. [48]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  49. [49]

    Codef: Content deformation fields for temporally consistent video processing

    Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8089–8099, 2024

  50. [50]

    Omniweaving: Towards unified video generation with free-form composition and reasoning.arXiv preprint arXiv:2603.24458, 2026

    Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, et al. Omniweaving: Towards unified video generation with free-form composition and reasoning.arXiv preprint arXiv:2603.24458, 2026

  51. [51]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  52. [52]

    Fatezero: Fusing attentions for zero-shot text-based video editing

    Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023

  53. [53]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  54. [54]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 12

  55. [55]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

  56. [56]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  57. [57]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  58. [58]

    Worldwander: Bridging egocentric and exocentric worlds in video generation.arXiv preprint arXiv:2511.22098, 2025

    Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, and Mike Zheng Shou. Worldwander: Bridging egocentric and exocentric worlds in video generation.arXiv preprint arXiv:2511.22098, 2025

  59. [59]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023

  60. [60]

    Processpainter: Learn painting process from sequence data.arXiv preprint arXiv:2406.06062, 2024

    Yiren Song, Shijie Huang, Chen Yao, Xiaojun Ye, Hai Ci, Jiaming Liu, Yuxuan Zhang, and Mike Zheng Shou. Processpainter: Learn painting process from sequence data.arXiv preprint arXiv:2406.06062, 2024

  61. [61]

    Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

    Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou. Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

  62. [62]

    Omniconsistency: Learning style-agnostic consistency from paired stylization data.arXiv preprint arXiv:2505.18445, 2025

    Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsistency: Learning style-agnostic consistency from paired stylization data.arXiv preprint arXiv:2505.18445, 2025

  63. [63]

    OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

    Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, and Mike Zheng Shou. Omnihumanoid: Streaming cross-embodiment video generation with paired-free adaptation.arXiv preprint arXiv:2605.12038, 2026

  64. [64]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  65. [65]

    Diffdecom- pose: Layer-wise decomposition of alpha-composited images via diffusion transformers.arXiv preprint arXiv:2505.21541, 2025

    Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, and Yiren Song. Diffdecom- pose: Layer-wise decomposition of alpha-composited images via diffusion transformers.arXiv preprint arXiv:2505.21541, 2025

  66. [66]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023

  67. [67]

    InsViE-1M: Effective instruction-based video editing with elaborate dataset construction.arXiv preprint arXiv:2503.20287, 2025

    Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. InsViE-1M: Effective instruction-based video editing with elaborate dataset construction.arXiv preprint arXiv:2503.20287, 2025

  68. [68]

    X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025

    Pei Yang, Hai Ci, Yiren Song, and Mike Zheng Shou. X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025

  69. [69]

    Rerender a video: Zero-shot text-guided video-to-video translation

    Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023

  70. [70]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  71. [71]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

  72. [72]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  73. [73]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

  74. [74]

    VEGGIE: instructional editing and reasoning video concepts with grounded generation.arXiv:2503.14350, 2025

    Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, and Mohit Bansal. VEGGIE: instructional editing and reasoning video concepts with grounded generation.arXiv:2503.14350, 2025. 13

  75. [75]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  76. [76]

    Ssr-encoder: Encoding selective subject representation for subject-driven generation

    Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024

  77. [77]

    score”: <1–10>, “reason

    Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025. Appendix A Limitations A primary limitation of our work stems from the composition of VideoEffect-130K. Both th...