StreamingEffect: Real-Time Human-Centric Video Effect Generation

arxiv: 2605.17019 · v1 · submitted 2026-05-16 · 💻 cs.CV

StreamingEffect: Real-Time Human-Centric Video Effect Generation

Yiren Song , Cheng Liu , Yuxin Jiang , Mike Zheng Shou This is my paper

Pith reviewed 2026-05-19 20:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords real-time video editinghuman-centric effectsmodel distillationcausal autoregressive generationstreaming videokeyframe controlvideo effect dataset

0 comments p. Extension

The pith

StreamingEffect distills a bidirectional video editing teacher into a causal student that generates real-time human-centric effects at 720p on one GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem of generating expressive video effects in real time for live applications like streaming and vlogging, where preserving the person's identity, the background, and smooth motion across frames is essential. It does this by first training a high-quality bidirectional teacher model on an in-context editing setup, then distilling that teacher down to a causal autoregressive student that works in a single forward pass. The authors also add keyframe control so that reference effect frames can be fed in during the stream and carried forward consistently. To overcome the lack of training data they release VideoEffect-130K, a large collection of effect and editing videos spanning 600 categories. If the distillation succeeds, high-quality streaming effect editing becomes feasible on modest hardware without needing future frames or heavy compute.

Core claim

A bidirectional teacher model trained for human-centric video effect editing can be distilled into a causal autoregressive student that operates in one pass with only four sampling steps while retaining comparable quality in identity preservation, background fidelity, and temporal consistency, and the addition of online keyframe injection makes the system interactive for live streams.

What carries the argument

Causal autoregressive student distilled from a bidirectional in-context editing teacher, accelerated to four sampling steps and augmented with keyframe control that propagates reference effects through the video stream.

If this is right

720p video effect editing runs in real time on a single H200 GPU.
Keyframe control lets users inject reference effect frames online that propagate consistently through the live stream.
The VideoEffect-130K dataset supplies training material across 600 effect categories for human-centric editing tasks.
Reducing diffusion steps from 50 to 4 yields the speed needed for streaming without retraining the core architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same teacher-to-causal-student pattern could be tested on other video-to-video tasks such as style transfer or object insertion to see if real-time performance generalizes.
Further quantization or pruning of the student might push the same pipeline onto consumer laptops or mobile devices while keeping acceptable quality.
Because the method works with one-pass autoregressive generation, it could reduce end-to-end latency in live video pipelines that currently rely on bidirectional processing.

Load-bearing premise

The distilled causal student model preserves human identity, background content, and temporal consistency at a level comparable to the bidirectional teacher when operating in a streaming, one-pass regime.

What would settle it

A direct frame-by-frame comparison on held-out test videos where the causal student version shows visibly worse identity drift or flickering artifacts than the bidirectional teacher under identical input conditions.

Figures

Figures reproduced from arXiv: 2605.17019 by Cheng Liu, Mike Zheng Shou, Yiren Song, Yuxin Jiang.

**Figure 2.** Figure 2: Overview of StreamingEffect. (a) Bidirectional Teacher Training: A bidirectional teacher is trained with reference-conditioned in-context video editing. (b) Causal Student Distillation: The teacher is distilled into a causal autoregressive student for streaming generation. (c) Sliding-window Autoregressive Inference: The student edits incoming video chunks online with cached context and propagates effects… view at source ↗

**Figure 3.** Figure 3: Construction and statistics of VideoEffect-130K. The dataset contains 130K paired human-centric videos, including 70K rendered effect samples and 60K general editing samples across about 600 categories. It is built with a hybrid synthetic-and-rendered pipeline, and each sample consists of a source video, a reference effect image, and a target edited video. clean in the cache, whereas the current chunk rema… view at source ↗

**Figure 4.** Figure 4: Qualitative results of StreamingEffect. (a) Image-guided: propagating reference-image effects while preserving identity. (b) Text-guided: following text prompts to generate diverse visual effects. Readers can click and play video clips using Adobe Acrobat. and 60K general-editing samples. The effect-rendering subset covers feature-level decorations, environmental effects, wearable accessories, style filter… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison against open-source (Kiwi-Edit, OmniWeaving) and commercial [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative ablation across distillation stages. Bidirectional: 50-step teacher with full attention. Causal AR 4/50 steps: Stage-1 causal student at 4 or 50 denoising steps. Self-Forcing: our final Stage-2 student (4 steps). Readers can click and play the video clips using Adobe Acrobat [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Streaming video effect generation is highly desirable for live human-centric applications such as e-commerce streaming, entertainment, and vlogging, yet remains difficult due to the lack of suitable data and deployable editing models. Unlike generic video generation, this task requires real-time video-to-video editing that adds expressive effects while preserving human identity, background content, and temporal consistency. Existing acceleration efforts mainly focus on text-to-video generation, while efficient distillation for video editing remains largely underexplored. In this paper, we present \textbf{StreamingEffect}, a real-time human-centric streaming video effect framework. We adopt an in-context video editing architecture and train a high-quality bidirectional teacher, then distill it into a causal autoregressive student and further reduce sampling from 50 steps to 4 steps. We also introduce keyframe control, allowing reference effect frames to be injected online and propagated through the stream for interactive editing. To address the data bottleneck, we construct \textbf{VideoEffect-130K}, to our knowledge the largest human-centric video effect dataset, containing 70K effect videos and 60K editing videos across 600 effect categories curated from short-video and editing platforms. Experiments show that our method enables real-time, high-quality 720p video editing on a single H200 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts distillation to a causal student for streaming video effects and ships a large new dataset, but the evidence that quality matches the teacher under one-pass constraints is still thin.

read the letter

The main point for you is that StreamingEffect takes existing bidirectional video editing models and distills them into a fast causal autoregressive version that supports real-time streaming with online keyframe control. They also release VideoEffect-130K, which looks like the biggest human-centric effect dataset so far, drawn from short-video platforms and covering 600 categories with both effect and editing videos.

Referee Report

1 major / 2 minor

Summary. The manuscript presents StreamingEffect, a framework for real-time human-centric streaming video effect generation. It trains a bidirectional teacher on an in-context video editing architecture, distills it into a causal autoregressive student reduced to 4 sampling steps, introduces keyframe control for online reference injection, and releases the VideoEffect-130K dataset (70K effect videos + 60K editing videos across 600 categories). Experiments claim to show real-time high-quality 720p editing on a single H200 GPU while preserving identity, background content, and temporal consistency.

Significance. If the central claim holds, the work could advance deployable live video editing for applications such as streaming and vlogging. Notable strengths include the construction of a large-scale human-centric effect dataset and the extension of distillation techniques to a causal streaming editing setting, both of which could support reproducible follow-up research.

major comments (1)

[Experiments] Experiments section: The central claim that the 4-step causal student preserves human identity, background fidelity, and temporal consistency at levels comparable to the bidirectional teacher in a strictly one-pass streaming regime lacks direct quantitative support. No side-by-side metrics (identity cosine similarity, background PSNR/SSIM, or temporal flicker scores) comparing teacher and student on identical streaming inputs are reported, leaving open the risk of compounding drift or reduced global context in the autoregressive student.

minor comments (2)

[Method] The keyframe control mechanism is introduced but would benefit from an explicit diagram or pseudocode showing online propagation through the causal stream.
[Dataset] Dataset curation details (e.g., exact filtering criteria for the 600 effect categories) could be expanded to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment on the Experiments section below and will incorporate additional quantitative analysis in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim that the 4-step causal student preserves human identity, background fidelity, and temporal consistency at levels comparable to the bidirectional teacher in a strictly one-pass streaming regime lacks direct quantitative support. No side-by-side metrics (identity cosine similarity, background PSNR/SSIM, or temporal flicker scores) comparing teacher and student on identical streaming inputs are reported, leaving open the risk of compounding drift or reduced global context in the autoregressive student.

Authors: We agree that direct side-by-side quantitative metrics between the bidirectional teacher and the 4-step causal student on identical streaming inputs would provide stronger evidence for the preservation of identity, background fidelity, and temporal consistency. The current manuscript reports the student's performance primarily through comparisons against external baselines, user studies, and qualitative results, without explicit numerical teacher-student comparisons in the one-pass streaming setting. In the revision, we will add a dedicated table (or subsection) reporting identity cosine similarity, background PSNR/SSIM, and temporal flicker scores computed on the same held-out streaming sequences for both models. This addition will directly quantify any potential drift or context limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent dataset construction and experimental validation

full rationale

The paper introduces a new human-centric video effect dataset (VideoEffect-130K) and applies standard knowledge distillation from a bidirectional teacher to a causal student model, followed by step reduction and keyframe control. The central performance claims (real-time 720p editing on H200 GPU) are presented as outcomes of empirical experiments rather than quantities defined by fitted parameters or self-referential equations. No load-bearing steps reduce by construction to inputs; the approach builds on existing distillation techniques but evaluates them in a new streaming setting with externally constructed data and reported metrics. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central performance claim rests on the assumption that standard video diffusion architectures can be distilled while retaining identity and consistency, plus the representativeness of the curated 130K dataset; no explicit free parameters or invented physical entities are described in the abstract.

free parameters (1)

sampling steps
Reduced from 50 to 4 steps to achieve real-time speed; this is a design choice that directly affects the reported latency-quality tradeoff.

invented entities (1)

Keyframe control mechanism no independent evidence
purpose: Allows reference effect frames to be injected online and propagated through the live stream for interactive editing.
New control interface introduced to enable interactive use in streaming settings.

pith-pipeline@v0.9.0 · 5760 in / 1347 out tokens · 58856 ms · 2026-05-19T20:08:23.611917+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt an in-context video editing architecture and train a high-quality bidirectional teacher, then distill it into a causal autoregressive student and further reduce sampling from 50 steps to 4 steps.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 25 internal anchors

[1]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

work page 2023
[4]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

work page 2023
[5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024

work page 2024
[6]

Pix2video: Video editing using image diffusion

Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 23206–23217, 2023

work page 2023
[7]

Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024. URL https://arxiv. org/abs/2407.01392

work page arXiv 2024
[8]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

work page 2024
[9]

Transanimate: Taming layer diffusion to generate rgba video

Xuewei Chen, Zhimin Chen, and Yiren Song. Transanimate: Taming layer diffusion to generate rgba video. arXiv preprint arXiv:2503.17934, 2025

work page arXiv 2025
[10]

Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

work page arXiv 2023
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Dollar: Few-step video generation via distillation and latent reward optimization

Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Krishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, and Yuchen Liu. Dollar: Few-step video generation via distillation and latent reward optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17961–17971, 2025

work page 2025
[13]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URLhttps://arxiv.org/abs/2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Emu video: Factorizing text-to-video generation by explicit image conditioning.arXiv preprint arXiv:2311.10709, 2023

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning.arXiv preprint arXiv:2311.10709, 2023

work page arXiv 2023
[16]

Relationadapter: Learning and transferring visual relation with diffusion transformers.arXiv preprint arXiv:2506.02528, 2025

Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, and Yin Zhang. Relationadapter: Learning and transferring visual relation with diffusion transformers.arXiv preprint arXiv:2506.02528, 2025

work page arXiv 2025
[17]

Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks.arXiv preprint arXiv:2501.15891, 2025

Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Chuang Zhang, and Jiaming Liu. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks.arXiv preprint arXiv:2501.15891, 2025

work page arXiv 2025
[18]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/ abs/2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[22]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.arXiv:2204.03458, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

work page 2025
[26]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023

work page 2023
[27]

Kling-Omni Technical Report

Kling Team. Kling-Omni Technical Report, 2025. URLhttps://arxiv.org/abs/2512.16776

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

Akio Kodaira, Tingbo Hou, Ji Hou, Markos Georgopoulos, Felix Juefei-Xu, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

work page arXiv 2025
[29]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

work page arXiv 2024
[31]

Vfxmaster: Unlocking dynamic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025

Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, et al. Vfxmaster: Unlocking dynamic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025

work page arXiv 2025
[32]

Ic-effect: Precise and efficient video effects editing via in-context learning.arXiv preprint arXiv:2512.15635, 2025

Yuanhang Li, Yiren Song, Junzhe Bai, Xinran Liang, Hu Yang, Libiao Jin, and Qi Mao. Ic-effect: Precise and efficient video effects editing via in-context learning.arXiv preprint arXiv:2512.15635, 2025

work page arXiv 2025
[33]

Look- ing backward: Streaming video-to-video translation with feature banks.arXiv preprint arXiv:2405.15757, 2024

Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Look- ing backward: Streaming video-to-video translation with feature banks.arXiv preprint arXiv:2405.15757, 2024

work page arXiv 2024
[34]

Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

work page arXiv 2025
[35]

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Streaming autoregressive video generation via diagonal distillation.arXiv preprint arXiv:2603.09488, 2026

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Streaming autoregressive video generation via diagonal distillation.arXiv preprint arXiv:2603.09488, 2026

work page arXiv 2026
[37]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Video-p2p: Video editing with cross-attention control

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024

work page 2024
[39]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Instaflow: One step is enough for high- quality diffusion-based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high- quality diffusion-based text-to-image generation. InInternational Conference on Learning Representations, 2024

work page 2024
[41]

Vfx creator: Animated visual effect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025

Xinyu Liu, Ailing Zeng, Wei Xue, Harry Yang, Wenhan Luo, Qifeng Liu, and Yike Guo. Vfx creator: Animated visual effect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025

work page arXiv 2025
[42]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Follow your pose: Pose-guided text-to-video generation using pose-free videos

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4117–4125, 2024

work page 2024
[45]

Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024

work page 2024
[46]

arXiv preprint arXiv:2506.04590 , year =

Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Junhao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, and Qifeng Chen. Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

work page arXiv 2025
[47]

Follow-your-click: Open-domain regional image animation via motion prompts

Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, et al. Follow-your-click: Open-domain regional image animation via motion prompts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6018–6026, 2025

work page 2025
[48]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Codef: Content deformation fields for temporally consistent video processing

Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8089–8099, 2024

work page 2024
[50]

Omniweaving: Towards unified video generation with free-form composition and reasoning.arXiv preprint arXiv:2603.24458, 2026

Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, et al. Omniweaving: Towards unified video generation with free-form composition and reasoning.arXiv preprint arXiv:2603.24458, 2026

work page arXiv 2026
[51]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[52]

Fatezero: Fusing attentions for zero-shot text-based video editing

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023

work page 2023
[53]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[54]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 12

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

work page 2024
[56]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[58]

Worldwander: Bridging egocentric and exocentric worlds in video generation.arXiv preprint arXiv:2511.22098, 2025

Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, and Mike Zheng Shou. Worldwander: Bridging egocentric and exocentric worlds in video generation.arXiv preprint arXiv:2511.22098, 2025

work page arXiv 2025
[59]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Processpainter: Learn painting process from sequence data.arXiv preprint arXiv:2406.06062, 2024

Yiren Song, Shijie Huang, Chen Yao, Xiaojun Ye, Hai Ci, Jiaming Liu, Yuxuan Zhang, and Mike Zheng Shou. Processpainter: Learn painting process from sequence data.arXiv preprint arXiv:2406.06062, 2024

work page arXiv 2024
[61]

Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou. Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

work page arXiv 2025
[62]

Omniconsistency: Learning style-agnostic consistency from paired stylization data.arXiv preprint arXiv:2505.18445, 2025

Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsistency: Learning style-agnostic consistency from paired stylization data.arXiv preprint arXiv:2505.18445, 2025

work page arXiv 2025
[63]

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, and Mike Zheng Shou. Omnihumanoid: Streaming cross-embodiment video generation with paired-free adaptation.arXiv preprint arXiv:2605.12038, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Diffdecom- pose: Layer-wise decomposition of alpha-composited images via diffusion transformers.arXiv preprint arXiv:2505.21541, 2025

Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, and Yiren Song. Diffdecom- pose: Layer-wise decomposition of alpha-composited images via diffusion transformers.arXiv preprint arXiv:2505.21541, 2025

work page arXiv 2025
[66]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023

work page 2023
[67]

InsViE-1M: Effective instruction-based video editing with elaborate dataset construction.arXiv preprint arXiv:2503.20287, 2025

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. InsViE-1M: Effective instruction-based video editing with elaborate dataset construction.arXiv preprint arXiv:2503.20287, 2025

work page arXiv 2025
[68]

X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025

Pei Yang, Hai Ci, Yiren Song, and Mike Zheng Shou. X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025

work page arXiv 2025
[69]

Rerender a video: Zero-shot text-guided video-to-video translation

Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023

work page 2023
[70]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024
[72]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024
[73]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

work page 2025
[74]

VEGGIE: instructional editing and reasoning video concepts with grounded generation.arXiv:2503.14350, 2025

Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, and Mohit Bansal. VEGGIE: instructional editing and reasoning video concepts with grounded generation.arXiv:2503.14350, 2025. 13

work page arXiv 2025
[75]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

work page 2023
[76]

Ssr-encoder: Encoding selective subject representation for subject-driven generation

Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024

work page 2024
[77]

score”: <1–10>, “reason

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025. Appendix A Limitations A primary limitation of our work stems from the composition of VideoEffect-130K. Both th...

work page 2025

[1] [1]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024

[2] [2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

work page 2023

[4] [4]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

work page 2023

[5] [5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024

work page 2024

[6] [6]

Pix2video: Video editing using image diffusion

Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 23206–23217, 2023

work page 2023

[7] [7]

Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024. URL https://arxiv. org/abs/2407.01392

work page arXiv 2024

[8] [8]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

work page 2024

[9] [9]

Transanimate: Taming layer diffusion to generate rgba video

Xuewei Chen, Zhimin Chen, and Yiren Song. Transanimate: Taming layer diffusion to generate rgba video. arXiv preprint arXiv:2503.17934, 2025

work page arXiv 2025

[10] [10]

Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023

work page arXiv 2023

[11] [11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Dollar: Few-step video generation via distillation and latent reward optimization

Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Krishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, and Yuchen Liu. Dollar: Few-step video generation via distillation and latent reward optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17961–17971, 2025

work page 2025

[13] [13]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URLhttps://arxiv.org/abs/2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Emu video: Factorizing text-to-video generation by explicit image conditioning.arXiv preprint arXiv:2311.10709, 2023

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning.arXiv preprint arXiv:2311.10709, 2023

work page arXiv 2023

[16] [16]

Relationadapter: Learning and transferring visual relation with diffusion transformers.arXiv preprint arXiv:2506.02528, 2025

Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, and Yin Zhang. Relationadapter: Learning and transferring visual relation with diffusion transformers.arXiv preprint arXiv:2506.02528, 2025

work page arXiv 2025

[17] [17]

Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks.arXiv preprint arXiv:2501.15891, 2025

Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Chuang Zhang, and Jiaming Liu. Any2anytryon: Leveraging adaptive position embeddings for versatile virtual clothing tasks.arXiv preprint arXiv:2501.15891, 2025

work page arXiv 2025

[18] [18]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/ abs/2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[22] [22]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.arXiv:2204.03458, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025

work page 2025

[26] [26]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023

work page 2023

[27] [27]

Kling-Omni Technical Report

Kling Team. Kling-Omni Technical Report, 2025. URLhttps://arxiv.org/abs/2512.16776

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

Akio Kodaira, Tingbo Hou, Ji Hou, Markos Georgopoulos, Felix Juefei-Xu, Masayoshi Tomizuka, and Yue Zhao. Streamdit: Real-time streaming text-to-video generation.arXiv preprint arXiv:2507.03745, 2025

work page arXiv 2025

[29] [29]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

work page arXiv 2024

[31] [31]

Vfxmaster: Unlocking dynamic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025

Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, et al. Vfxmaster: Unlocking dynamic visual effect generation via in-context learning.arXiv preprint arXiv:2510.25772, 2025

work page arXiv 2025

[32] [32]

Ic-effect: Precise and efficient video effects editing via in-context learning.arXiv preprint arXiv:2512.15635, 2025

Yuanhang Li, Yiren Song, Junzhe Bai, Xinran Liang, Hu Yang, Libiao Jin, and Qi Mao. Ic-effect: Precise and efficient video effects editing via in-context learning.arXiv preprint arXiv:2512.15635, 2025

work page arXiv 2025

[33] [33]

Look- ing backward: Streaming video-to-video translation with feature banks.arXiv preprint arXiv:2405.15757, 2024

Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Look- ing backward: Streaming video-to-video translation with feature banks.arXiv preprint arXiv:2405.15757, 2024

work page arXiv 2024

[34] [34]

Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

work page arXiv 2025

[35] [35]

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, and Mike Zheng Shou. Kiwi-edit: Versatile video editing via instruction and reference guidance.arXiv preprint arXiv:2603.02175, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Streaming autoregressive video generation via diagonal distillation.arXiv preprint arXiv:2603.09488, 2026

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Streaming autoregressive video generation via diagonal distillation.arXiv preprint arXiv:2603.09488, 2026

work page arXiv 2026

[37] [37]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Video-p2p: Video editing with cross-attention control

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024

work page 2024

[39] [39]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

Instaflow: One step is enough for high- quality diffusion-based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high- quality diffusion-based text-to-image generation. InInternational Conference on Learning Representations, 2024

work page 2024

[41] [41]

Vfx creator: Animated visual effect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025

Xinyu Liu, Ailing Zeng, Wei Xue, Harry Yang, Wenhan Luo, Qifeng Liu, and Yike Guo. Vfx creator: Animated visual effect generation with controllable diffusion transformer.arXiv preprint arXiv:2502.05979, 2025

work page arXiv 2025

[42] [42]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Follow your pose: Pose-guided text-to-video generation using pose-free videos

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4117–4125, 2024

work page 2024

[45] [45]

Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024

work page 2024

[46] [46]

arXiv preprint arXiv:2506.04590 , year =

Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Junhao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, and Qifeng Chen. Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

work page arXiv 2025

[47] [47]

Follow-your-click: Open-domain regional image animation via motion prompts

Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Leqi Shen, Chenyang Qi, Jixuan Ying, Chengfei Cai, Zhifeng Li, Heung-Yeung Shum, et al. Follow-your-click: Open-domain regional image animation via motion prompts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6018–6026, 2025

work page 2025

[48] [48]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Codef: Content deformation fields for temporally consistent video processing

Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. Codef: Content deformation fields for temporally consistent video processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8089–8099, 2024

work page 2024

[50] [50]

Omniweaving: Towards unified video generation with free-form composition and reasoning.arXiv preprint arXiv:2603.24458, 2026

Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, et al. Omniweaving: Towards unified video generation with free-form composition and reasoning.arXiv preprint arXiv:2603.24458, 2026

work page arXiv 2026

[51] [51]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[52] [52]

Fatezero: Fusing attentions for zero-shot text-based video editing

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023

work page 2023

[53] [53]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[54] [54]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 12

work page internal anchor Pith review Pith/arXiv arXiv 2022

[55] [55]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

work page 2024

[56] [56]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[57] [57]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[58] [58]

Worldwander: Bridging egocentric and exocentric worlds in video generation.arXiv preprint arXiv:2511.22098, 2025

Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, and Mike Zheng Shou. Worldwander: Bridging egocentric and exocentric worlds in video generation.arXiv preprint arXiv:2511.22098, 2025

work page arXiv 2025

[59] [59]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

Processpainter: Learn painting process from sequence data.arXiv preprint arXiv:2406.06062, 2024

Yiren Song, Shijie Huang, Chen Yao, Xiaojun Ye, Hai Ci, Jiaming Liu, Yuxuan Zhang, and Mike Zheng Shou. Processpainter: Learn painting process from sequence data.arXiv preprint arXiv:2406.06062, 2024

work page arXiv 2024

[61] [61]

Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou. Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

work page arXiv 2025

[62] [62]

Omniconsistency: Learning style-agnostic consistency from paired stylization data.arXiv preprint arXiv:2505.18445, 2025

Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsistency: Learning style-agnostic consistency from paired stylization data.arXiv preprint arXiv:2505.18445, 2025

work page arXiv 2025

[63] [63]

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, and Mike Zheng Shou. Omnihumanoid: Streaming cross-embodiment video generation with paired-free adaptation.arXiv preprint arXiv:2605.12038, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[64] [64]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

Diffdecom- pose: Layer-wise decomposition of alpha-composited images via diffusion transformers.arXiv preprint arXiv:2505.21541, 2025

Zitong Wang, Hang Zhao, Qianyu Zhou, Xuequan Lu, Xiangtai Li, and Yiren Song. Diffdecom- pose: Layer-wise decomposition of alpha-composited images via diffusion transformers.arXiv preprint arXiv:2505.21541, 2025

work page arXiv 2025

[66] [66]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023

work page 2023

[67] [67]

InsViE-1M: Effective instruction-based video editing with elaborate dataset construction.arXiv preprint arXiv:2503.20287, 2025

Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, and Lei Zhang. InsViE-1M: Effective instruction-based video editing with elaborate dataset construction.arXiv preprint arXiv:2503.20287, 2025

work page arXiv 2025

[68] [68]

X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025

Pei Yang, Hai Ci, Yiren Song, and Mike Zheng Shou. X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025

work page arXiv 2025

[69] [69]

Rerender a video: Zero-shot text-guided video-to-video translation

Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023

work page 2023

[70] [70]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024

[72] [72]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024

[73] [73]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

work page 2025

[74] [74]

VEGGIE: instructional editing and reasoning video concepts with grounded generation.arXiv:2503.14350, 2025

Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, and Mohit Bansal. VEGGIE: instructional editing and reasoning video concepts with grounded generation.arXiv:2503.14350, 2025. 13

work page arXiv 2025

[75] [75]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

work page 2023

[76] [76]

Ssr-encoder: Encoding selective subject representation for subject-driven generation

Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024

work page 2024

[77] [77]

score”: <1–10>, “reason

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025. Appendix A Limitations A primary limitation of our work stems from the composition of VideoEffect-130K. Both th...

work page 2025