pith. sign in

arxiv: 2605.23458 · v1 · pith:SDVSF2X5new · submitted 2026-05-22 · 💻 cs.CV · cs.AI

One-Forcing: Towards Stable One-Step Autoregressive Video Generation

Pith reviewed 2026-05-25 04:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords one-step video generationautoregressive videoDMD objectiveauxiliary GAN lossVBenchconsistency distillationcausal video models
0
0 comments X

The pith

One-Forcing augments the DMD objective with an auxiliary GAN loss to enable stable one-step autoregressive video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces One-Forcing to tackle quality issues in one-step autoregressive video generation. Most few-step methods degrade sharply when reduced to one step, with trajectory distillation yielding weak dynamics and DMD approaches producing blurry frames. One-Forcing combines the DMD objective with an auxiliary GAN loss to produce high-quality videos. On the VBench benchmark it scores 83.76, which is the best among one-step causal methods and close to strong many-step baselines. It further shows that framewise one-step autoregressive generation works stably using only one-third the training cost required for chunkwise models.

Core claim

One-Forcing augments the DMD objective with an auxiliary GAN loss. This addresses the blurriness of prior DMD-based one-step methods and the weak dynamics of trajectory-style consistency distillation. The result is high-quality and efficient one-step video generation that reaches a total VBench score of 83.76 while using significantly less training compute for the framewise autoregressive case.

What carries the argument

One-Forcing, the augmentation of the DMD objective with an auxiliary GAN loss that stabilizes one-step sampling in autoregressive video generators.

If this is right

  • Achieves a VBench total score of 83.76 as state-of-the-art for one-step causal video generation methods.
  • Remains competitive with strong many-step video generation approaches.
  • One-step framewise autoregressive generation becomes stable with one-third the training cost of the chunkwise model.
  • Prior one-step methods failed to achieve stable framewise generation successfully.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reduced training cost for framewise models could support deployment in settings where chunkwise processing adds unnecessary overhead.
  • Success with the GAN correction in the one-step regime may indicate that similar auxiliary losses could stabilize other distillation methods.

Load-bearing premise

The auxiliary GAN loss reliably corrects blurriness and weak dynamics without causing instability or new artifacts, and the reported training-cost savings apply beyond the specific experimental setup.

What would settle it

An experiment that removes the auxiliary GAN loss and measures whether the VBench score falls below competitive levels or whether framewise one-step training becomes unstable.

Figures

Figures reproduced from arXiv: 2605.23458 by Cho-Jui Hsieh, Jiaqi Feng, Justin Cui, Yuanhao Ban.

Figure 1
Figure 1. Figure 1: Videos were generated from two prompts using four distinct methods. Wan2.1 teacher [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: One-Forcing training framework. Starting from a one-step causal rollout, One-Forcing optimizes the generated latent distribution with two coupled signals: a DMD gradient from the difference between the trainable fake score and the frozen real score, and an adversarial gradient from a noised-latent discriminator trained against real data. Both signals share the fake-score backbone, so the critic learns deno… view at source ↗
Figure 3
Figure 3. Figure 3: Relative trajectory-curvature profiles show high-noise concentration for Wan video genera [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Discriminator logit gap |lr − lf | during training. One-Forcing (blue) main￾tains a large, varying gap, while ASD (red) stays near zero, confirming a collapsed dis￾criminator. Discriminator effectiveness. We compare the ad￾versarial training dynamics of One-Forcing against ASD [47]. Both methods attach a classification branch to the fake-score backbone operating in noised latent space, but they differ fund… view at source ↗
Figure 5
Figure 5. Figure 5: Training loss curves for One-Forcing (blue) and ASD (red) over the first 100 steps. (a) DMD [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: VBench scores across all 16 dimensions for selected Table 1 entries. Higher radial values [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes One-Forcing, a method that augments the DMD objective with an auxiliary GAN loss to enable stable one-step causal autoregressive video generation. It reports achieving a total score of 83.76 on VBench, claiming state-of-the-art performance among one-step methods while remaining competitive with many-step approaches, and demonstrates stable framewise autoregressive generation at one-third the training cost of chunkwise models.

Significance. If the empirical claims hold after proper validation, the work would advance efficient real-time video generation by addressing blurriness and weak dynamics in prior one-step methods while substantially reducing both inference latency and training cost. The combination of DMD with auxiliary GAN loss is a straightforward extension that could generalize if the stability claims are substantiated.

major comments (1)
  1. [Abstract] Abstract: The central claim of a VBench total score of 83.76 establishing SOTA among one-step causal methods is presented without any description of experimental setup, baselines, statistical significance, ablations, or implementation details. This absence makes the performance claim impossible to assess and is load-bearing for the paper's primary contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. The major comment highlights an important point about the abstract's self-containment. We address it directly below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of a VBench total score of 83.76 establishing SOTA among one-step causal methods is presented without any description of experimental setup, baselines, statistical significance, ablations, or implementation details. This absence makes the performance claim impossible to assess and is load-bearing for the paper's primary contribution.

    Authors: We agree that the abstract, as a standalone summary, should provide enough context to allow readers to assess the central claim without immediately consulting the full text. The manuscript's Sections 3 and 4 already detail the experimental protocol (VBench evaluation on 1K videos, causal framewise autoregressive setting), baselines (Self-Forcing, other DMD variants, chunkwise models, and multi-step methods), ablations on the GAN loss, and implementation (training cost comparison at one-third of chunkwise models). However, to directly address the concern, we will revise the abstract to concisely reference the evaluation setting, primary baselines, and the one-step causal regime. This change strengthens accessibility while preserving the abstract's brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents One-Forcing as an empirical method that augments the DMD objective with an auxiliary GAN loss term. Its central claims consist of benchmark scores on VBench (total 83.76) and a reported training-cost reduction, both obtained through direct experimentation rather than any mathematical derivation or prediction step. No equations, fitted parameters, or self-citations are shown to reduce the reported results to their inputs by construction. The argument structure is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5731 in / 1075 out tokens · 24244 ms · 2026-05-25T04:36:06.408589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 24 canonical work pages · 13 internal anchors

  1. [1]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/index/ video-generation-models-as-world-simulators/

  2. [2]

    Veo: a text-to-video generation system

    Google DeepMind. Veo: a text-to-video generation system. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/veo/ Veo-3-Tech-Report.pdf

  3. [3]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  4. [4]

    HunyuanVideo: A systematic framework for large video generative models,

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  5. [5]

    URLhttps://arxiv.org/abs/2412.03603

  6. [6]

    Seedance 2.0: Advancing video generation for world complexity,

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, et al. Seedance 2.0: Advancing video generation for world complexity,

  7. [7]

    URLhttps://arxiv.org/abs/2604.14148

  8. [8]

    Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22963–22974, June 2025. doi: 10.1109/CVPR52734.2025.02138. 10 URL https://openacces...

  9. [9]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neu- ral Information Processing Systems, volume 38, pages 167283–167308. Curran Associates, Inc., 2025. URL ...

  10. [10]

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal Forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026. URLhttps://arxiv.org/abs/2602.02214

  11. [11]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, volume 31, pages 2450–2462. Cur- ran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 2de5d16682c3c35007e4e92982f1a2ba-Abstract.html

  12. [12]

    Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025. doi: 10.1038/ s41586-025-08744-2. URLhttps://doi.org/10.1038/s41586-025-08744-2

  13. [13]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Chris- tos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Ann...

  14. [14]

    Astra: General interactive world model with autoregressive denoising

    Yixuan Zhu, Feng Jiaqi, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jiwen Lu, and Jie Zhou. Astra: General interactive world model with autoregressive denoising. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=8UZpmrxoLG

  15. [15]

    Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando De Freitas, Satinder Singh, and Tim Rocktäschel

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elis- abeth Bechtle, Feryal Behbahani, Stephanie C.Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nan...

  16. [16]

    Diffusion models are real- time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real- time game engines. InThe Thirteenth International Conference on Learning Representations,

  17. [17]

    URLhttps://openreview.net/forum?id=P8pqeEkn1H

  18. [18]

    2024 , burl =

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24174– 24184, June 2024. doi: 10.1109/CVPR52733.2024.02282. URL https://openaccess. thecvf.com/content...

  19. [19]

    2024 , burl =

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. 11 InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 6613–6623, June 2024. doi: 10.1109/CVPR52733.2024.00632. URL https://openaccess.the...

  20. [20]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Infor- mation Processing Systems, volume 35, pages 8633–8646. Curran Associates, Inc.,

  21. [21]

    URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 39235c56aef13fb05a6adc95eb9d8d66-Paper-Conference.pdf

  22. [22]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen Video: High definition video generation with diffusion models, 2022. URL https: //arxiv.org/abs/2210.02303

  23. [23]

    CogVideoX: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...

  24. [24]

    Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

    Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, and Jun Zhu. Causal Forcing++: Scalable few-step autoregressive diffusion distillation for real-time interactive video generation, 2026. URL https://arxiv.org/abs/ 2605.15141

  25. [25]

    Sand.ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng...

  26. [26]

    LongLive: Real-time interactive long video generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying-Cong Chen, Yao Lu, Song Han, and Yukang Chen. LongLive: Real-time interactive long video generation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=nCAODkpsPJ

  27. [27]

    Rolling forcing: Autoregressive long video diffusion in real time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=IAyzXjbfwo

  28. [28]

    Infinity-RoPE: Action-controllable infinite video generation emerges from autoregressive self- rollout, 2025

    Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-RoPE: Action-controllable infinite video generation emerges from autoregressive self- rollout, 2025. URLhttps://arxiv.org/abs/2511.20649. CVPR 2026

  29. [29]

    Self-forcing++: Towards minute-scale high-quality video generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=DzvPiqh23f

  30. [30]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

  31. [31]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, volu...

  32. [32]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2...

  33. [33]

    Improved techniques for training consistency models

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=WNzy9bRDvG

  34. [34]

    Simplifying, stabilizing and scaling continuous-time consistency models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=LyJi5ugyJx

  35. [35]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference, 2023. URL https://arxiv. org/abs/2310.04378

  36. [36]

    One step diffusion via shortcut models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=OlzB6LnXcS

  37. [37]

    Large scale diffusion distillation via score-regularized continuous-time consistency

    Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=2uNlM353RI

  38. [38]

    Wong, Yu Qiao, and Ziwei Liu

    Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, and Ziwei Liu. Dual-expert consistency model for efficient and high-quality video gen- eration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14983–14993, October 2025. URL https://openaccess.thecvf.com/ content/ICCV2025/html/Lv_Dual-Ex...

  39. [39]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Frédo Du- rand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 47455–4748...

  40. [40]

    Transition matching distillation for fast video generation, 2026

    Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, and Arash Vahdat. Transition matching distillation for fast video generation, 2026. URL https://arxiv.org/abs/2601. 09881

  41. [41]

    Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

    Xingtong Ge, Yi Zhang, Yushi Huang, Dailan He, Xiahong Wang, Bingqi Ma, Guanglu Song, Yu Liu, and Jun Zhang. Salt: Self-consistent distribution matching with cache-aware training for fast video generation, 2026. URLhttps://arxiv.org/abs/2604.03118

  42. [42]

    Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, and Min Zhang. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation, 2025. URLhttps://arxiv.org/abs/2512.04678

  43. [43]

    Streaming autoregressive video generation via diagonal distillation

    Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Streaming autoregressive video generation via diagonal distillation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=X7YW6STzeL

  44. [44]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors,Advances 13 in Neural Information Processing Systems, volume 27, pages 2672–2680. Curran Asso...

  45. [45]

    Generating videos with scene dynamics

    Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. InAdvances in Neural Information Processing Systems, volume 29, pages 613–621. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/ 2016/file/04025959b191f8f9de3f924f0940515f-Paper.pdf

  46. [46]

    MoCoGAN: Decomposing motion and content for video generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526–1535, June 2018. doi: 10.1109/CVPR.2018. 00165. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Tulyakov_ MoCoGAN_Decomposing_M...

  47. [47]

    StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2

    Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3626–3636, June 2022. URL https://openaccess.thecvf.com/content/ CVPR2022/html/Skorokhodov_StyleGAN-V_A_Co...

  48. [48]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InComputer Vision – ECCV 2024, volume 15144 ofLecture Notes in Computer Science, pages 87–103. Springer, 2024. doi: 10.1007/978-3-031-73016-0_6. URL https: //doi.org/10.1007/978-3-031-73016-0_6

  49. [49]

    Diffusion adversarial post-training for one-step video generation

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267...

  50. [50]

    Autoregressive adversarial post-training for real-time interactive video generation

    Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time interactive video generation. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 410...

  51. [51]

    Towards one-step causal video generation via adversarial self-distillation

    Yongqi Yang, Huayang Huang, Xu Peng, Xiaobin Hu, Donghao Luo, Jiangning Zhang, Chengjie Wang, and Yu Wu. Towards one-step causal video generation via adversarial self-distillation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=P3O0fNmnWa

  52. [52]

    Phased one-step adversarial equilibrium for video diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 40(5): 3237–3245, March 2026

    Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Henry Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, and Qinglin Lu. Phased one-step adversarial equilibrium for video diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 40(5): 3237–3245, March 2026. doi: 10.1609/aaai.v40i5.37318. URL https://ojs.aaai.org/ index.php/A...

  53. [53]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=XVjTT1nw5z

  54. [54]

    2024 , burl =

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video gener- ative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

  55. [55]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. SkyReels-V2: Infinite-length film generative model...

  56. [56]

    Autoregressive video generation without vector quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=JE9tCwe3lp

  57. [57]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime video latent diffusion, 2024. URLhttps://arxiv.org/abs/2501.00103

  58. [58]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InThe Thirteenth International Conference on Learning Representations,

  59. [59]

    URLhttps://openreview.net/forum?id=66NzcRQuOq

  60. [60]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Mar- jorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, ...

  61. [61]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards long-term geometric consistency for real-time interactive world modeling, 2025. URL https://arxiv.org/abs/ 2512.14614

  62. [62]

    Matrix-game: Interactive world foundation model, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model, 2025. URLhttps://arxiv.org/abs/2506.18701. A Details of Implementations Our implementation is based on the Causal Forcing codebase [8] and the Wan2.1 model family [3]. The re...