One-Forcing: Towards Stable One-Step Autoregressive Video Generation

Cho-Jui Hsieh; Jiaqi Feng; Justin Cui; Yuanhao Ban

arxiv: 2605.23458 · v1 · pith:SDVSF2X5new · submitted 2026-05-22 · 💻 cs.CV · cs.AI

One-Forcing: Towards Stable One-Step Autoregressive Video Generation

Jiaqi Feng , Justin Cui , Yuanhao Ban , Cho-Jui Hsieh This is my paper

Pith reviewed 2026-05-25 04:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords one-step video generationautoregressive videoDMD objectiveauxiliary GAN lossVBenchconsistency distillationcausal video models

0 comments

The pith

One-Forcing augments the DMD objective with an auxiliary GAN loss to enable stable one-step autoregressive video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces One-Forcing to tackle quality issues in one-step autoregressive video generation. Most few-step methods degrade sharply when reduced to one step, with trajectory distillation yielding weak dynamics and DMD approaches producing blurry frames. One-Forcing combines the DMD objective with an auxiliary GAN loss to produce high-quality videos. On the VBench benchmark it scores 83.76, which is the best among one-step causal methods and close to strong many-step baselines. It further shows that framewise one-step autoregressive generation works stably using only one-third the training cost required for chunkwise models.

Core claim

One-Forcing augments the DMD objective with an auxiliary GAN loss. This addresses the blurriness of prior DMD-based one-step methods and the weak dynamics of trajectory-style consistency distillation. The result is high-quality and efficient one-step video generation that reaches a total VBench score of 83.76 while using significantly less training compute for the framewise autoregressive case.

What carries the argument

One-Forcing, the augmentation of the DMD objective with an auxiliary GAN loss that stabilizes one-step sampling in autoregressive video generators.

If this is right

Achieves a VBench total score of 83.76 as state-of-the-art for one-step causal video generation methods.
Remains competitive with strong many-step video generation approaches.
One-step framewise autoregressive generation becomes stable with one-third the training cost of the chunkwise model.
Prior one-step methods failed to achieve stable framewise generation successfully.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduced training cost for framewise models could support deployment in settings where chunkwise processing adds unnecessary overhead.
Success with the GAN correction in the one-step regime may indicate that similar auxiliary losses could stabilize other distillation methods.

Load-bearing premise

The auxiliary GAN loss reliably corrects blurriness and weak dynamics without causing instability or new artifacts, and the reported training-cost savings apply beyond the specific experimental setup.

What would settle it

An experiment that removes the auxiliary GAN loss and measures whether the VBench score falls below competitive levels or whether framewise one-step training becomes unstable.

Figures

Figures reproduced from arXiv: 2605.23458 by Cho-Jui Hsieh, Jiaqi Feng, Justin Cui, Yuanhao Ban.

**Figure 2.** Figure 2: One-Forcing training framework. Starting from a one-step causal rollout, One-Forcing optimizes the generated latent distribution with two coupled signals: a DMD gradient from the difference between the trainable fake score and the frozen real score, and an adversarial gradient from a noised-latent discriminator trained against real data. Both signals share the fake-score backbone, so the critic learns deno… view at source ↗

**Figure 3.** Figure 3: Relative trajectory-curvature profiles show high-noise concentration for Wan video genera [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Discriminator logit gap |lr − lf | during training. One-Forcing (blue) maintains a large, varying gap, while ASD (red) stays near zero, confirming a collapsed discriminator. Discriminator effectiveness. We compare the adversarial training dynamics of One-Forcing against ASD [47]. Both methods attach a classification branch to the fake-score backbone operating in noised latent space, but they differ fund… view at source ↗

**Figure 5.** Figure 5: Training loss curves for One-Forcing (blue) and ASD (red) over the first 100 steps. (a) DMD [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: VBench scores across all 16 dimensions for selected Table 1 entries. Higher radial values [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

One-Forcing adds an auxiliary GAN loss to DMD to target blurriness and weak dynamics in one-step autoregressive video, reporting 83.76 on VBench, but the abstract gives little experimental detail to back the claims.

read the letter

The main thing here is that One-Forcing augments DMD with an auxiliary GAN loss to get better one-step autoregressive video generation. The authors claim this reaches 83.76 on VBench for one-step causal methods while staying competitive with many-step approaches, and that it enables stable framewise generation at one-third the training cost of chunkwise setups. The new part is applying this loss combination specifically to push one-step performance without the blurriness seen in Self-Forcing or the weak dynamics in consistency methods. The paper does well at clearly stating the practical problem of latency in few-step models and offering a direct way to address it through the added loss term. The soft spots center on the evidence presented. The abstract gives the key score and the cost reduction but supplies no information on baselines, ablations for the GAN component, or how the comparisons were run. This makes it tough to judge whether the GAN loss is truly responsible for the gains or if other factors are at play. The claim that the approach avoids instability from the GAN term also rests on the reported outcomes without visible checks in the summary. This paper would interest people working on real-time video synthesis who need lower latency autoregressive models. Readers focused on distillation techniques might find the idea useful as a practical tweak. It deserves a serious referee because the underlying issue is relevant to current video generation work and the method is simple enough that solid experiments could make it a worthwhile reference point. I would recommend sending it out for peer review so the full experimental details can be evaluated.

Referee Report

1 major / 0 minor

Summary. The paper proposes One-Forcing, a method that augments the DMD objective with an auxiliary GAN loss to enable stable one-step causal autoregressive video generation. It reports achieving a total score of 83.76 on VBench, claiming state-of-the-art performance among one-step methods while remaining competitive with many-step approaches, and demonstrates stable framewise autoregressive generation at one-third the training cost of chunkwise models.

Significance. If the empirical claims hold after proper validation, the work would advance efficient real-time video generation by addressing blurriness and weak dynamics in prior one-step methods while substantially reducing both inference latency and training cost. The combination of DMD with auxiliary GAN loss is a straightforward extension that could generalize if the stability claims are substantiated.

major comments (1)

[Abstract] Abstract: The central claim of a VBench total score of 83.76 establishing SOTA among one-step causal methods is presented without any description of experimental setup, baselines, statistical significance, ablations, or implementation details. This absence makes the performance claim impossible to assess and is load-bearing for the paper's primary contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. The major comment highlights an important point about the abstract's self-containment. We address it directly below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a VBench total score of 83.76 establishing SOTA among one-step causal methods is presented without any description of experimental setup, baselines, statistical significance, ablations, or implementation details. This absence makes the performance claim impossible to assess and is load-bearing for the paper's primary contribution.

Authors: We agree that the abstract, as a standalone summary, should provide enough context to allow readers to assess the central claim without immediately consulting the full text. The manuscript's Sections 3 and 4 already detail the experimental protocol (VBench evaluation on 1K videos, causal framewise autoregressive setting), baselines (Self-Forcing, other DMD variants, chunkwise models, and multi-step methods), ablations on the GAN loss, and implementation (training cost comparison at one-third of chunkwise models). However, to directly address the concern, we will revise the abstract to concisely reference the evaluation setting, primary baselines, and the one-step causal regime. This change strengthens accessibility while preserving the abstract's brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents One-Forcing as an empirical method that augments the DMD objective with an auxiliary GAN loss term. Its central claims consist of benchmark scores on VBench (total 83.76) and a reported training-cost reduction, both obtained through direct experimentation rather than any mathematical derivation or prediction step. No equations, fitted parameters, or self-citations are shown to reduce the reported results to their inputs by construction. The argument structure is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5731 in / 1075 out tokens · 24244 ms · 2026-05-25T04:36:06.408589+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

augments the DMD objective with an auxiliary GAN loss... shared fake-score transformer backbone
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Wan video trajectories concentrate 92.5% of their curvature mass at t≥0.9

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 24 canonical work pages · 13 internal anchors

[1]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/index/ video-generation-models-as-world-simulators/

2024
[2]

Veo: a text-to-video generation system

Google DeepMind. Veo: a text-to-video generation system. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/veo/ Veo-3-Tech-Report.pdf

2025
[3]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

HunyuanVideo: A systematic framework for large video generative models,

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...
[5]

URLhttps://arxiv.org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Seedance 2.0: Advancing video generation for world complexity,

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, et al. Seedance 2.0: Advancing video generation for world complexity,
[7]

URLhttps://arxiv.org/abs/2604.14148

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22963–22974, June 2025. doi: 10.1109/CVPR52734.2025.02138. 10 URL https://openacces...

work page doi:10.1109/cvpr52734.2025.02138 2025
[9]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neu- ral Information Processing Systems, volume 38, pages 167283–167308. Curran Associates, Inc., 2025. URL ...

2025
[10]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal Forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026. URLhttps://arxiv.org/abs/2602.02214

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, volume 31, pages 2450–2462. Cur- ran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 2de5d16682c3c35007e4e92982f1a2ba-Abstract.html

2018
[12]

Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025. doi: 10.1038/ s41586-025-08744-2. URLhttps://doi.org/10.1038/s41586-025-08744-2

work page doi:10.1038/s41586-025-08744-2 2025
[13]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Chris- tos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Ann...

2024
[14]

Astra: General interactive world model with autoregressive denoising

Yixuan Zhu, Feng Jiaqi, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jiwen Lu, and Jie Zhou. Astra: General interactive world model with autoregressive denoising. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=8UZpmrxoLG

2026
[15]

Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando De Freitas, Satinder Singh, and Tim Rocktäschel

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elis- abeth Bechtle, Feryal Behbahani, Stephanie C.Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nan...

2024
[16]

Diffusion models are real- time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real- time game engines. InThe Thirteenth International Conference on Learning Representations,
[17]

URLhttps://openreview.net/forum?id=P8pqeEkn1H
[18]

2024 , burl =

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24174– 24184, June 2024. doi: 10.1109/CVPR52733.2024.02282. URL https://openaccess. thecvf.com/content...

work page doi:10.1109/cvpr52733.2024.02282 2024
[19]

2024 , burl =

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. 11 InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 6613–6623, June 2024. doi: 10.1109/CVPR52733.2024.00632. URL https://openaccess.the...

work page doi:10.1109/cvpr52733.2024.00632 2024
[20]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Infor- mation Processing Systems, volume 35, pages 8633–8646. Curran Associates, Inc.,
[21]

URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 39235c56aef13fb05a6adc95eb9d8d66-Paper-Conference.pdf

2022
[22]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen Video: High definition video generation with diffusion models, 2022. URL https: //arxiv.org/abs/2210.02303

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...

2025
[24]

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, and Jun Zhu. Causal Forcing++: Scalable few-step autoregressive diffusion distillation for real-time interactive video generation, 2026. URL https://arxiv.org/abs/ 2605.15141

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Sand.ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

LongLive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying-Cong Chen, Yao Lu, Song Han, and Yukang Chen. LongLive: Real-time interactive long video generation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=nCAODkpsPJ

2026
[27]

Rolling forcing: Autoregressive long video diffusion in real time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=IAyzXjbfwo

2026
[28]

Infinity-RoPE: Action-controllable infinite video generation emerges from autoregressive self- rollout, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-RoPE: Action-controllable infinite video generation emerges from autoregressive self- rollout, 2025. URLhttps://arxiv.org/abs/2511.20649. CVPR 2026

work page arXiv 2025
[29]

Self-forcing++: Towards minute-scale high-quality video generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=DzvPiqh23f

2026
[30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

2023
[31]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, volu...

2024
[32]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2...

2023
[33]

Improved techniques for training consistency models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=WNzy9bRDvG

2024
[34]

Simplifying, stabilizing and scaling continuous-time consistency models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=LyJi5ugyJx

2025
[35]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference, 2023. URL https://arxiv. org/abs/2310.04378

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

One step diffusion via shortcut models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=OlzB6LnXcS

2025
[37]

Large scale diffusion distillation via score-regularized continuous-time consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=2uNlM353RI

2026
[38]

Wong, Yu Qiao, and Ziwei Liu

Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, and Ziwei Liu. Dual-expert consistency model for efficient and high-quality video gen- eration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14983–14993, October 2025. URL https://openaccess.thecvf.com/ content/ICCV2025/html/Lv_Dual-Ex...

2025
[39]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Frédo Du- rand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 47455–4748...

2024
[40]

Transition matching distillation for fast video generation, 2026

Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, and Arash Vahdat. Transition matching distillation for fast video generation, 2026. URL https://arxiv.org/abs/2601. 09881

2026
[41]

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Xingtong Ge, Yi Zhang, Yushi Huang, Dailan He, Xiahong Wang, Bingqi Ma, Guanglu Song, Yu Liu, and Jun Zhang. Salt: Self-consistent distribution matching with cache-aware training for fast video generation, 2026. URLhttps://arxiv.org/abs/2604.03118

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, and Min Zhang. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation, 2025. URLhttps://arxiv.org/abs/2512.04678

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Streaming autoregressive video generation via diagonal distillation

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Streaming autoregressive video generation via diagonal distillation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=X7YW6STzeL

2026
[44]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors,Advances 13 in Neural Information Processing Systems, volume 27, pages 2672–2680. Curran Asso...

2014
[45]

Generating videos with scene dynamics

Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. InAdvances in Neural Information Processing Systems, volume 29, pages 613–621. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/ 2016/file/04025959b191f8f9de3f924f0940515f-Paper.pdf

work page arXiv 2016
[46]

MoCoGAN: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526–1535, June 2018. doi: 10.1109/CVPR.2018. 00165. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Tulyakov_ MoCoGAN_Decomposing_M...

work page doi:10.1109/cvpr.2018 2018
[47]

StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2

Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3626–3636, June 2022. URL https://openaccess.thecvf.com/content/ CVPR2022/html/Skorokhodov_StyleGAN-V_A_Co...

2022
[48]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InComputer Vision – ECCV 2024, volume 15144 ofLecture Notes in Computer Science, pages 87–103. Springer, 2024. doi: 10.1007/978-3-031-73016-0_6. URL https: //doi.org/10.1007/978-3-031-73016-0_6

work page doi:10.1007/978-3-031-73016-0_6 2024
[49]

Diffusion adversarial post-training for one-step video generation

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267...

2025
[50]

Autoregressive adversarial post-training for real-time interactive video generation

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time interactive video generation. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 410...

2025
[51]

Towards one-step causal video generation via adversarial self-distillation

Yongqi Yang, Huayang Huang, Xu Peng, Xiaobin Hu, Donghao Luo, Jiangning Zhang, Chengjie Wang, and Yu Wu. Towards one-step causal video generation via adversarial self-distillation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=P3O0fNmnWa

2026
[52]

Phased one-step adversarial equilibrium for video diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 40(5): 3237–3245, March 2026

Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Henry Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, and Qinglin Lu. Phased one-step adversarial equilibrium for video diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 40(5): 3237–3245, March 2026. doi: 10.1609/aaai.v40i5.37318. URL https://ojs.aaai.org/ index.php/A...

work page doi:10.1609/aaai.v40i5.37318 2026
[53]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=XVjTT1nw5z

2023
[54]

2024 , burl =

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video gener- ative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

work page doi:10.1109/cvpr52733.2024.02060 2024
[55]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. SkyReels-V2: Infinite-length film generative model...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=JE9tCwe3lp

2025
[57]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime video latent diffusion, 2024. URLhttps://arxiv.org/abs/2501.00103

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InThe Thirteenth International Conference on Learning Representations,
[59]

URLhttps://openreview.net/forum?id=66NzcRQuOq
[60]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Mar- jorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, ...

2025
[61]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards long-term geometric consistency for real-time interactive world modeling, 2025. URL https://arxiv.org/abs/ 2512.14614

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Matrix-game: Interactive world foundation model, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model, 2025. URLhttps://arxiv.org/abs/2506.18701. A Details of Implementations Our implementation is based on the Causal Forcing codebase [8] and the Wan2.1 model family [3]. The re...

work page arXiv 2025

[1] [1]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/index/ video-generation-models-as-world-simulators/

2024

[2] [2]

Veo: a text-to-video generation system

Google DeepMind. Veo: a text-to-video generation system. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/veo/ Veo-3-Tech-Report.pdf

2025

[3] [3]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

HunyuanVideo: A systematic framework for large video generative models,

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

[5] [5]

URLhttps://arxiv.org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Seedance 2.0: Advancing video generation for world complexity,

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, et al. Seedance 2.0: Advancing video generation for world complexity,

[7] [7]

URLhttps://arxiv.org/abs/2604.14148

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22963–22974, June 2025. doi: 10.1109/CVPR52734.2025.02138. 10 URL https://openacces...

work page doi:10.1109/cvpr52734.2025.02138 2025

[9] [9]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neu- ral Information Processing Systems, volume 38, pages 167283–167308. Curran Associates, Inc., 2025. URL ...

2025

[10] [10]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal Forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026. URLhttps://arxiv.org/abs/2602.02214

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, volume 31, pages 2450–2462. Cur- ran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/hash/ 2de5d16682c3c35007e4e92982f1a2ba-Abstract.html

2018

[12] [12]

Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025. doi: 10.1038/ s41586-025-08744-2. URLhttps://doi.org/10.1038/s41586-025-08744-2

work page doi:10.1038/s41586-025-08744-2 2025

[13] [13]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Chris- tos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Ann...

2024

[14] [14]

Astra: General interactive world model with autoregressive denoising

Yixuan Zhu, Feng Jiaqi, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jiwen Lu, and Jie Zhou. Astra: General interactive world model with autoregressive denoising. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=8UZpmrxoLG

2026

[15] [15]

Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando De Freitas, Satinder Singh, and Tim Rocktäschel

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elis- abeth Bechtle, Feryal Behbahani, Stephanie C.Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nan...

2024

[16] [16]

Diffusion models are real- time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real- time game engines. InThe Thirteenth International Conference on Learning Representations,

[17] [17]

URLhttps://openreview.net/forum?id=P8pqeEkn1H

[18] [18]

2024 , burl =

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24174– 24184, June 2024. doi: 10.1109/CVPR52733.2024.02282. URL https://openaccess. thecvf.com/content...

work page doi:10.1109/cvpr52733.2024.02282 2024

[19] [19]

2024 , burl =

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. 11 InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 6613–6623, June 2024. doi: 10.1109/CVPR52733.2024.00632. URL https://openaccess.the...

work page doi:10.1109/cvpr52733.2024.00632 2024

[20] [20]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Infor- mation Processing Systems, volume 35, pages 8633–8646. Curran Associates, Inc.,

[21] [21]

URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 39235c56aef13fb05a6adc95eb9d8d66-Paper-Conference.pdf

2022

[22] [22]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen Video: High definition video generation with diffusion models, 2022. URL https: //arxiv.org/abs/2210.02303

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Represen...

2025

[24] [24]

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, and Jun Zhu. Causal Forcing++: Scalable few-step autoregressive diffusion distillation for real-time interactive video generation, 2026. URL https://arxiv.org/abs/ 2605.15141

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Sand.ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shucheng...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

LongLive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying-Cong Chen, Yao Lu, Song Han, and Yukang Chen. LongLive: Real-time interactive long video generation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=nCAODkpsPJ

2026

[27] [27]

Rolling forcing: Autoregressive long video diffusion in real time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=IAyzXjbfwo

2026

[28] [28]

Infinity-RoPE: Action-controllable infinite video generation emerges from autoregressive self- rollout, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-RoPE: Action-controllable infinite video generation emerges from autoregressive self- rollout, 2025. URLhttps://arxiv.org/abs/2511.20649. CVPR 2026

work page arXiv 2025

[29] [29]

Self-forcing++: Towards minute-scale high-quality video generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=DzvPiqh23f

2026

[30] [30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

2023

[31] [31]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, volu...

2024

[32] [32]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32211–32252. PMLR, 23–29 Jul 2...

2023

[33] [33]

Improved techniques for training consistency models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=WNzy9bRDvG

2024

[34] [34]

Simplifying, stabilizing and scaling continuous-time consistency models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=LyJi5ugyJx

2025

[35] [35]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference, 2023. URL https://arxiv. org/abs/2310.04378

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

One step diffusion via shortcut models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=OlzB6LnXcS

2025

[37] [37]

Large scale diffusion distillation via score-regularized continuous-time consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=2uNlM353RI

2026

[38] [38]

Wong, Yu Qiao, and Ziwei Liu

Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, and Ziwei Liu. Dual-expert consistency model for efficient and high-quality video gen- eration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14983–14993, October 2025. URL https://openaccess.thecvf.com/ content/ICCV2025/html/Lv_Dual-Ex...

2025

[39] [39]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Frédo Du- rand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 47455–4748...

2024

[40] [40]

Transition matching distillation for fast video generation, 2026

Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, and Arash Vahdat. Transition matching distillation for fast video generation, 2026. URL https://arxiv.org/abs/2601. 09881

2026

[41] [41]

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Xingtong Ge, Yi Zhang, Yushi Huang, Dailan He, Xiahong Wang, Bingqi Ma, Guanglu Song, Yu Liu, and Jun Zhang. Salt: Self-consistent distribution matching with cache-aware training for fast video generation, 2026. URLhttps://arxiv.org/abs/2604.03118

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, and Min Zhang. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation, 2025. URLhttps://arxiv.org/abs/2512.04678

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Streaming autoregressive video generation via diagonal distillation

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Streaming autoregressive video generation via diagonal distillation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=X7YW6STzeL

2026

[44] [44]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors,Advances 13 in Neural Information Processing Systems, volume 27, pages 2672–2680. Curran Asso...

2014

[45] [45]

Generating videos with scene dynamics

Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. InAdvances in Neural Information Processing Systems, volume 29, pages 613–621. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/ 2016/file/04025959b191f8f9de3f924f0940515f-Paper.pdf

work page arXiv 2016

[46] [46]

MoCoGAN: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526–1535, June 2018. doi: 10.1109/CVPR.2018. 00165. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Tulyakov_ MoCoGAN_Decomposing_M...

work page doi:10.1109/cvpr.2018 2018

[47] [47]

StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2

Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3626–3636, June 2022. URL https://openaccess.thecvf.com/content/ CVPR2022/html/Skorokhodov_StyleGAN-V_A_Co...

2022

[48] [48]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InComputer Vision – ECCV 2024, volume 15144 ofLecture Notes in Computer Science, pages 87–103. Springer, 2024. doi: 10.1007/978-3-031-73016-0_6. URL https: //doi.org/10.1007/978-3-031-73016-0_6

work page doi:10.1007/978-3-031-73016-0_6 2024

[49] [49]

Diffusion adversarial post-training for one-step video generation

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267...

2025

[50] [50]

Autoregressive adversarial post-training for real-time interactive video generation

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time interactive video generation. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 410...

2025

[51] [51]

Towards one-step causal video generation via adversarial self-distillation

Yongqi Yang, Huayang Huang, Xu Peng, Xiaobin Hu, Donghao Luo, Jiangning Zhang, Chengjie Wang, and Yu Wu. Towards one-step causal video generation via adversarial self-distillation. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=P3O0fNmnWa

2026

[52] [52]

Phased one-step adversarial equilibrium for video diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 40(5): 3237–3245, March 2026

Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Henry Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, and Qinglin Lu. Phased one-step adversarial equilibrium for video diffusion models.Proceedings of the AAAI Conference on Artificial Intelligence, 40(5): 3237–3245, March 2026. doi: 10.1609/aaai.v40i5.37318. URL https://ojs.aaai.org/ index.php/A...

work page doi:10.1609/aaai.v40i5.37318 2026

[53] [53]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=XVjTT1nw5z

2023

[54] [54]

2024 , burl =

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video gener- ative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

work page doi:10.1109/cvpr52733.2024.02060 2024

[55] [55]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. SkyReels-V2: Infinite-length film generative model...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=JE9tCwe3lp

2025

[57] [57]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime video latent diffusion, 2024. URLhttps://arxiv.org/abs/2501.00103

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InThe Thirteenth International Conference on Learning Representations,

[59] [59]

URLhttps://openreview.net/forum?id=66NzcRQuOq

[60] [60]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Mar- jorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, ...

2025

[61] [61]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards long-term geometric consistency for real-time interactive world modeling, 2025. URL https://arxiv.org/abs/ 2512.14614

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Matrix-game: Interactive world foundation model, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model, 2025. URLhttps://arxiv.org/abs/2506.18701. A Details of Implementations Our implementation is based on the Causal Forcing codebase [8] and the Wan2.1 model family [3]. The re...

work page arXiv 2025