pith. sign in

arxiv: 2606.25473 · v1 · pith:RQF7FHNGnew · submitted 2026-06-24 · 💻 cs.CV · cs.LG

Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

Pith reviewed 2026-06-25 20:55 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords autoregressive video diffusiondiffusion distillationconsistency modelsstreaming video generationteacher-forcingself-forcingcausal traininginteractive world models
0
0 comments X

The pith

Causal-rCM combines teacher-forcing consistency models as initialization with self-forcing DMD refinement to distill autoregressive video diffusion models that generate high-quality streaming video in one or two steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the rCM distillation approach to autoregressive video diffusion by treating teacher-forcing as a forward-divergence causal training method and self-forcing as a reverse-divergence on-policy refinement. Experiments establish that teacher-forcing CM supplies the strongest initialization for subsequent self-forcing DMD. The work also supplies the first continuous-time CM implementation for this setting via a custom JVP kernel that yields tenfold faster convergence than discrete-time variants. The resulting Causal-rCM recipe produces a 2-step model reaching 84.63 VBench-T2V on Wan2.1-1.3B and transfers to action-conditioned world models.

Core claim

The core philosophy of complementarity between forward and reverse divergences carries over directly to the autoregressive setting, so that teacher-forcing CM serves as the best initialization complement to self-forcing DMD; this pairing, together with continuous-time CMs enabled by a custom-mask FlashAttention-2 JVP kernel, yields state-of-the-art streaming video generation in both frame-wise and chunk-wise regimes using only synthetic data and supports interactive world models on Cosmos 3.

What carries the argument

Causal-rCM, the unified recipe that pairs teacher-forcing continuous-time consistency models with self-forcing distribution matching distillation for causal autoregressive diffusion training.

If this is right

  • Teacher-forcing CM initialization is currently the strongest complement to self-forcing DMD.
  • Continuous-time CMs converge ten times faster than discrete-time CMs under the custom JVP kernel.
  • The distilled 2-step Wan2.1-1.3B model attains 84.63 VBench-T2V with one or two sampling steps.
  • The same recipe reaches state-of-the-art results in both frame-wise and chunk-wise streaming settings on synthetic data alone.
  • Causal-rCM transfers to action-conditioned generation inside the Cosmos 3 omnimodal world model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same teacher-forcing plus self-forcing pairing could be tested on other autoregressive sequence domains such as audio or text.
  • One- or two-step sampling opens the possibility of real-time interactive video world models on consumer hardware.
  • Training entirely on synthetic data suggests that the method may scale without large curated video corpora.
  • The custom JVP kernel technique may generalize to accelerate consistency training in other transformer architectures.

Load-bearing premise

The complementarity between forward and reverse divergences that worked in ordinary diffusion distillation also holds when the model must generate video autoregressively with causal attention.

What would settle it

A controlled run in which replacing teacher-forcing CM initialization with random or standard initialization produces equal or better final self-forcing DMD performance on the same autoregressive video backbone would falsify the claimed complementarity.

Figures

Figures reproduced from arXiv: 2606.25473 by Chen-Hsuan Lin, Guande He, Huayu Chen, Jianfei Chen, Jintao Zhang, Jun Zhu, Kaiwen Zheng, Ming-Yu Liu, Min Zhao, Qianli Ma.

Figure 1
Figure 1. Figure 1: State-of-the-art performance of Causal-rCM for streaming video generation (1-step: 84.63). Causal-rCM achieves leading VBench-T2V scores across 1-step, 2-step, and 4-step generation, under both frame-wise and chunk-wise autoregressive regimes. © 2026 NVIDIA. All rights reserved. arXiv:2606.25473v1 [cs.CV] 24 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A unified divergence perspective of rCM (Zheng et al., 2025) and Causal-rCM. However, self-forcing with DMD or GAN objectives is sensitive to initialization and suffers from mode collapse, as DMD-style objectives are based on reverse-KL divergence and optimize student-generated rollouts. Existing AR diffusion systems therefore introduce different initialization strategies before self-forcing, such as ODE-p… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of causal training paradigms, adapted from Self-Forcing (Huang et al., 2025). Autoregressive (AR) video diffusion factorizes video generation along the temporal dimension. Given a video latent sequence 𝑥0 = [𝑥 1 0 , . . . , 𝑥 𝑁 0 ] divided into frames or chunks, an AR model generates each block conditioned on previous blocks: 𝑝𝜃(𝑥0) = ∏︀𝑁 𝑖=1 𝑝𝜃(𝑥 𝑖 0 |𝑥 <𝑖 0 ). Within each temporal block, the… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between Causal-rCM and other approaches. To extend rCM to autoregressive diffusion, we pair its two distillation objectives (CM, DMD) with two causal training paradigms, teacher-forcing (TF) and self-forcing (SF), respectively. This preserves the forward-reverse correspondence of rCM in the autoregressive setting: TF-CM provides an offline, forward-type consistency objective, whereas SF-DMD prov… view at source ↗
Figure 5
Figure 5. Figure 5: Adaptation to acceleration techniques: noisy context and custom step schedule. Noisy context and custom step schedules (Liu et al., 2026) are two simplest and most effective inference acceleration techniques for AR video diffusion distillation. Both TF and SF can naturally incorporate them, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training curves of TF-dCM and TF-sCM [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SF-DMD training curves with different initialization strategies. In the frame-wise setting, TF-CM initialization achieves the best overall performance, with DF and TF-KD also providing competitive alternatives. Although TF-sCM starts from a stronger initial model, TF-dCM is more stable during SF-DMD and supports longer refinement, leading to a higher peak score. In the chunk-wise setting, DF/TF initializat… view at source ↗
Figure 8
Figure 8. Figure 8: Visualizations of chunk-wise SF-DMD under different initialization strategies. DF/TF initialization leads to higher VBench-T2V scores while suffering from overly smooth textures and lacking fine-grained details. UND GEN UND GEN Q K UND GEN UND GEN Q K Causal self-attention Full cross-attention Full self-attention Causal self-attention Full cross-attention Temporal causal self-attention Latent frame # 0 1 2… view at source ↗
Figure 9
Figure 9. Figure 9: From Cosmos 3 to interactive Cosmos 3. Cosmos 3 uses causal self-attention for UND tokens, full cross-attention from GEN to UND tokens, and bidirectional self-attention within GEN tokens. Interactive Cosmos 3 preserves the UND-GEN attention structure but replaces GEN self-attention with temporal-causal attention over latent-frame supertokens. In the forward-dynamics layout, 𝑉𝑖 denotes a vision supertoken, … view at source ↗
Figure 10
Figure 10. Figure 10: Cosmos 3 interactive generation on autonomous-driving scenarios conditioned on the action of the vehicle ego-motion. JVP computation compatible with FlashAttention, FSDP, and context parallelism, and combines it with DMD regularization (Zheng et al., 2025). Causal-rCM extends this line to autoregressive video diffusion, applying JVP-based teacher-forcing sCM under clean causal contexts as a structured ini… view at source ↗
read the original abstract

Autoregressive video diffusion with causal diffusion transformers has emerged as a major paradigm for real-time streaming video generation and action-conditioned interactive world models. In this work, we extend rCM, an advanced diffusion distillation framework, to autoregressive video diffusion. The core philosophy of rCM lies in the complementarity between forward and reverse divergences, represented by consistency models (CMs) and distribution matching distillation (DMD), respectively, in diffusion distillation. This philosophy naturally carries over to the autoregressive setting, where teacher-forcing (TF) provides an offline, forward-divergence causal training paradigm, while self-forcing (SF) corresponds to an on-policy, reverse-divergence refinement. Our contributions are: (1) through extensive experiments, we show that teacher-forcing CM is currently the best complement to self-forcing DMD as an initialization strategy (2) we present the first implementation of teacher-forcing-based continuous-time CMs (e.g., sCM/MeanFlow) for autoregressive video diffusion, enabled by our custom-mask FlashAttention-2 JVP kernel, achieving 10$\times$ faster convergence compared to discrete-time CMs (dCMs) (3) we introduce Causal-rCM, a leading, unified, and scalable algorithm-infrastructure open recipe for diffusion distillation and causal training (4) we achieve state-of-the-art streaming video generation performance in both frame-wise and chunk-wise settings, using only synthetic data for training. Notably, our distilled 2-step causal Wan2.1-1.3B model achieves a VBench-T2V score of 84.63 with only 1 or 2 sampling steps. We further apply Causal-rCM to Cosmos 3, an advanced omnimodal world foundation model for physical AI with action-conditioned generation capability, enabling an interactive world model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper extends the rCM diffusion distillation framework to autoregressive video diffusion, arguing that the complementarity between teacher-forcing consistency models (forward divergence) and self-forcing DMD (reverse divergence) naturally carries over to the causal setting. It introduces Causal-rCM as a unified recipe, implements continuous-time CMs with custom FlashAttention kernels for 10x faster convergence, and reports SOTA performance including a VBench-T2V score of 84.63 with 1-2 steps on a distilled 2-step causal Wan2.1-1.3B model, applied also to Cosmos 3 for interactive world models, all trained on synthetic data.

Significance. If the results hold, this provides a significant advance in efficient streaming video generation and action-conditioned world models by offering an open, scalable distillation method that achieves high performance with very few sampling steps. The emphasis on an open recipe is a strength for reproducibility in the field.

major comments (2)
  1. [Abstract] Abstract: The foundational claim that the rCM complementarity 'naturally carries over' to the autoregressive setting (with TF CM as best initialization for SF DMD) is presented without analysis of how causal masking, frame-wise conditioning, or self-forcing's sequential error accumulation affect the forward-reverse synergy or exposure bias; this assumption underpins all listed contributions and requires explicit justification beyond empirical assertion.
  2. [Abstract] Abstract: Performance claims including 10× faster convergence for continuous-time CMs and the VBench-T2V score of 84.63 are reported without reference to exact baselines, data splits, number of runs, or statistical tests, preventing verification of the SOTA and convergence assertions.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'currently the best complement' is used without citing the specific ablation or comparison that establishes this ranking among possible initializations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the justification and clarity of our claims. We address each major comment below and will incorporate revisions in the next version of the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The foundational claim that the rCM complementarity 'naturally carries over' to the autoregressive setting (with TF CM as best initialization for SF DMD) is presented without analysis of how causal masking, frame-wise conditioning, or self-forcing's sequential error accumulation affect the forward-reverse synergy or exposure bias; this assumption underpins all listed contributions and requires explicit justification beyond empirical assertion.

    Authors: Our claim rests on the empirical demonstration through extensive experiments that teacher-forcing CM provides the strongest initialization for self-forcing DMD. We acknowledge that a more explicit analysis of causal masking, frame-wise conditioning, and exposure bias effects on the forward-reverse synergy would improve rigor. In the revised manuscript, we will add a dedicated discussion paragraph (with supporting ablation observations) addressing how these autoregressive factors preserve the complementarity. revision: yes

  2. Referee: [Abstract] Abstract: Performance claims including 10× faster convergence for continuous-time CMs and the VBench-T2V score of 84.63 are reported without reference to exact baselines, data splits, number of runs, or statistical tests, preventing verification of the SOTA and convergence assertions.

    Authors: We agree the abstract is too concise on these points. The body of the manuscript specifies the baselines (discrete-time CMs and prior distillation approaches), evaluation protocol on VBench-T2V, and synthetic data sources. We will revise the abstract to reference these details and note that convergence comparisons were obtained over repeated training runs. Full tables with any variance measures appear in the experiments section. revision: yes

Circularity Check

0 steps flagged

Empirical engineering paper with no derivation reducing to inputs

full rationale

The paper reports an extension of rCM to autoregressive video diffusion via teacher-forcing CM and self-forcing DMD, with all listed contributions (implementation of continuous-time CMs, Causal-rCM recipe, SOTA scores on VBench-T2V) resting on experimental outcomes and custom infrastructure rather than any closed-form derivation. The statement that rCM complementarity 'naturally carries over' is presented as a premise tested by 'extensive experiments' identifying TF CM as best initialization; no equations, fitted parameters renamed as predictions, or self-citation chains are shown that would make the central claims equivalent to their inputs by construction. The work is self-contained against external benchmarks such as VBench scores and is therefore scored at the default non-circularity level.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the rCM forward-reverse complementarity transfers to causal autoregressive video diffusion and on standard diffusion training assumptions; no free parameters or invented physical entities are introduced in the abstract.

axioms (1)
  • domain assumption The complementarity between forward (consistency-model) and reverse (distribution-matching) divergences in rCM naturally carries over to the autoregressive video setting.
    Explicitly stated as the core philosophy that enables all four listed contributions.

pith-pipeline@v0.9.1-grok · 5917 in / 1496 out tokens · 32136 ms · 2026-06-25T20:55:30.211617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 34 linked inside Pith

  1. [1]

    World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 2

  2. [2]

    Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov

    Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025. 2

  3. [3]

    Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

    Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024. 2

  4. [4]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. 2024.URL https://openai. com/research/video-generation-models-as-world-simulators, 3, 2024. 2

  5. [5]

    Mode seeking meets mean seeking for fast long video generation.arXiv preprint arXiv:2602.24289, 2026

    Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, et al. Mode seeking meets mean seeking for fast long video generation.arXiv preprint arXiv:2602.24289, 2026. 18

  6. [6]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 2

  7. [7]

    Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 2

  8. [8]

    pi-flow: Policy-based few-step generation via imitation distillation.arXiv preprint arXiv:2510.14974, 2025

    HanshengChen,KaiZhang,HaoTan,LeonidasGuibas,GordonWetzstein,andSaiBi. pi-flow: Policy-based few-step generation via imitation distillation.arXiv preprint arXiv:2510.14974, 2025. 18

  9. [9]

    Longlive2.0: An nvfp4 parallel infrastructure for long video generation.arXiv preprint arXiv,

    Yukang Chen, Luozhou Wang, Wei Huang, Shuai Yang, Bohan Zhang, Yicheng Xiao, Ruihang Chu, Weian Mao, Qixin Hu, Shaoteng Liu, Yuyang Zhao, Huizi Mao, Ying-Cong Chen, Enze Xie, Xiaojuan Qi, and Song Han. Longlive2.0: An nvfp4 parallel infrastructure for long video generation.arXiv preprint arXiv,

  10. [10]

    Twinflow: Realizing one-step generation on large models with self-adversarial flows.arXiv preprint arXiv:2512.05150, 2025

    Zhenglin Cheng, Peng Sun, Jianguo Li, and Tao Lin. Twinflow: Realizing one-step generation on large models with self-adversarial flows.arXiv preprint arXiv:2512.05150, 2025. 18

  11. [11]

    Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4, 2024

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4, 2024. 8

  12. [12]

    One-forcing: Towards stable one-step autore- gressive video generation.arXiv preprint arXiv:2605.23458, 2026

    Jiaqi Feng, Justin Cui, Yuanhao Ban, and Cho-Jui Hsieh. One-forcing: Towards stable one-step autore- gressive video generation.arXiv preprint arXiv:2605.23458, 2026. 19

  13. [13]

    Vidarc: Embodied video diffusion model for closed-loop control.arXiv preprint arXiv:2512.17661, 2025

    Yao Feng, Chendong Xiang, Xinyi Mao, Hengkai Tan, Zuyue Zhang, Shuhe Huang, Kaiwen Zheng, Haitian Liu, Hang Su, and Jun Zhu. Vidarc: Embodied video diffusion model for closed-loop control.arXiv preprint arXiv:2512.17661, 2025. 2

  14. [14]

    Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 2 20 Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation

  15. [15]

    Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025. 3, 5, 17

  16. [16]

    Anyflow: Any-step video diffusion model with on-policy flow map distillation.arXiv preprint arXiv:2605.13724,

    Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, and Mike Zheng Shou. Anyflow: Any-step video diffusion model with on-policy flow map distillation.arXiv preprint arXiv:2605.13724,

  17. [17]

    FastVideo: A unified inference and post-training framework for accelerated video generation,

    Hao-AI Lab. FastVideo: A unified inference and post-training framework for accelerated video generation,

  18. [18]

    URLhttps://github.com/hao-ai-lab/FastVideo. 11

  19. [19]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Size Wu, Wei Li, Xuchen Song, Yang Liu, Yangguang Li, and Yahui Zhou. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025. 2, 3

  20. [20]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 4

  21. [21]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, and Hao Tan. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025. 2, 3, 12

  22. [22]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 15

  23. [23]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 2, 3, 6, 7, 9, 11, 14, 18

  24. [24]

    Live avatar: Streaming real-time audio-driven avatar generation with infinite length.arXiv preprint arXiv:2512.04677, 2025

    Yubo Huang, Hailong Guo, Fangtai Wu, Weiqiang Wang, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, and Steven Hoi. Live avatar: Streaming real-time audio-driven avatar generation with infinite length.arXiv preprint arXiv:2512.04677, 2025. 2, 3, 10

  25. [25]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 14

  26. [26]

    Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

    Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025. 2

  27. [27]

    Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023. 11

  28. [28]

    Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

    Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Changsheng Lu, Zhen Li, et al. Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025. 18

  29. [29]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InInternational Conference on Learning Representations, volume 2025, pages 23378–23402, 2025. 2 21 Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for A...

  30. [30]

    Consistency trajectory models: Learning probability flow ode trajectory of diffusion.arXiv preprint arXiv:2310.02279, 2023

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion.arXiv preprint arXiv:2310.02279, 2023. 5

  31. [31]

    Memrope: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

    Youngrae Kim, Qixin Hu, C-C Jay Kuo, and Peter A Beerel. Memrope: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026. 12

  32. [32]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2

  33. [33]

    Aad-1: Asymmetric adversarial distillation for one-step autoregressive video generation.arXiv preprint arXiv:2606.03972, 2026

    Haobo Li, Yanhong Zeng, Yunhong Lu, Jiapeng Zhu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yujun Shen, and Zhipeng Zhang. Aad-1: Asymmetric adversarial distillation for one-step autoregressive video generation.arXiv preprint arXiv:2606.03972, 2026. 19

  34. [34]

    Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026

    Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026. 12

  35. [35]

    Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 2

  36. [36]

    Diffusion adversarial post-training for one-step video generation

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 37959–37974. PMLR, 2025. 2, 3, 19

  37. [37]

    Autoregressive adversarial post-training for real-time interactive video generation.arXiv preprint arXiv:2506.09350, 2025

    Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time interactive video generation.arXiv preprint arXiv:2506.09350, 2025. 2, 3, 7, 18, 19

  38. [38]

    Continuous adversarial flow models

    Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, and Haoqi Fan. Continuous adversarial flow models. arXiv preprint arXiv:2604.11521, 2026. 18

  39. [39]

    Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 4

  40. [40]

    Streaming autoregressive video generation via diagonal distillation

    Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, and Weiyang Liu. Streaming autoregressive video generation via diagonal distillation. InICLR, 2026. 10

  41. [41]

    Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 4

  42. [42]

    Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 3, 5, 17, 29

  43. [43]

    Maximum likelihood training for score-based diffusion odes by high order denoising score matching

    Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Maximum likelihood training for score-based diffusion odes by high order denoising score matching. InInternational conference on machine learning, pages 14429–14460. PMLR, 2022. 17

  44. [44]

    Knowledge distillation in iterative generative models for improved sampling speed.arXiv preprint arXiv:2101.02388, 2021

    Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed.arXiv preprint arXiv:2101.02388, 2021. 8

  45. [45]

    Nvidia fastgen: Fast generation from diffusion models, 2026

    Weili Nie, Julius Berner, Chao Liu, and Arash Vahdat. Nvidia fastgen: Fast generation from diffusion models, 2026. URLhttps://github.com/NVlabs/FastGen. 11 22 Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation

  46. [46]

    Transition matching distillation for fast video generation.arXiv preprint arXiv:2601.09881, 2026

    Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, and Arash Vahdat. Transition matching distillation for fast video generation.arXiv preprint arXiv:2601.09881, 2026. 18, 19

  47. [47]

    Elucidating the exposure bias in diffusion models

    Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, and Itir Onal Ertugrul. Elucidating the exposure bias in diffusion models. InInternational Conference on Learning Representations, volume 2024, pages 15167–15189, 2024. 2

  48. [48]

    Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

    NVIDIA. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026. URLhttps://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf. 2, 16

  49. [49]

    Eflow: Fast few-step video generator training from scratch via efficient solution flow.arXiv preprint arXiv:2603.27086, 2026

    Dogyun Park, Yanyu Li, Sergey Tulyakov, and Anil Kag. Eflow: Fast few-step video generator training from scratch via efficient solution flow.arXiv preprint arXiv:2603.27086, 2026. 19

  50. [50]

    Facm: Flow-anchored consistency models.arXiv preprint arXiv:2507.03738, 2025

    Yansong Peng, Kai Zhu, Yu Liu, Pingyu Wu, Hebei Li, Xiaoyan Sun, and Feng Wu. Facm: Flow-anchored consistency models.arXiv preprint arXiv:2507.03738, 2025. 17

  51. [51]

    Few-step distillation for text-to-image generation: A practical guide.arXiv preprint arXiv:2512.13006, 2025

    Yifan Pu, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Fan Wang, Bohan Zhuang, and Gao Huang. Few-step distillation for text-to-image generation: A practical guide.arXiv preprint arXiv:2512.13006, 2025. 18

  52. [52]

    Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

  53. [53]

    Align your flow: Scaling continuous-time flow map distillation.arXiv preprint arXiv:2506.14603, 2025

    Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your flow: Scaling continuous-time flow map distillation.arXiv preprint arXiv:2506.14603, 2025. 17

  54. [54]

    Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024

    Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024. 2

  55. [55]

    Generalization in generation: A closer look at exposure bias

    Florian Schmidt. Generalization in generation: A closer look at exposure bias. InProceedings of the 3rd Workshop on Neural Generation and Translation, pages 157–167, 2019. 2

  56. [56]

    Seedance 2.0: Advancing video generation for world complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148, 2026. 2

  57. [57]

    Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024. 2

  58. [58]

    Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. 4

  59. [59]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023. 5

  60. [60]

    Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025. 2

  61. [61]

    Flow map distillation without data.arXiv preprint arXiv:2511.19428, 2025

    Shangyuan Tong, Nanye Ma, Saining Xie, and Tommi Jaakkola. Flow map distillation without data.arXiv preprint arXiv:2511.19428, 2025. 18 23 Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation

  62. [62]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  63. [63]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in neural information processing systems, 36:8406–8441, 2023

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in neural information processing systems, 36:8406–8441, 2023. 6

  64. [64]

    Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139, 2026

    Tianhe Wu, Ruibin Li, Lei Zhang, and Kede Ma. Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139, 2026. 18

  65. [65]

    Longlive: Real-time interactive long video generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation. InICLR, 2026. 2, 14

  66. [66]

    Data-regularized reinforcement learning for diffusion models at scale

    Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, et al. Data-regularized reinforcement learning for diffusion models at scale. arXiv preprint arXiv:2512.04332, 2025. 4

  67. [67]

    World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026. 2

  68. [68]

    Infinity-rope: Action- controllable infinite video generation emerges from autoregressive self-rollout

    Hidir Yesiltepe, Tuna Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-rope: Action- controllable infinite video generation emerges from autoregressive self-rollout. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 40256–40265, 2026. 12

  69. [69]

    Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025

    Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forc- ing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025. 12

  70. [70]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024. 2, 6, 9

  71. [71]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 2, 6

  72. [72]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025. 3

  73. [73]

    Adaptive video distillation: Mitigating oversaturation and temporal collapse in few-step generation.arXiv preprint arXiv:2603.21864,

    Yuyang You, Yongzhi Li, Jiahui Li, Yadong Mu, Quan Chen, and Peng Jiang. Adaptive video distillation: Mitigating oversaturation and temporal collapse in few-step generation.arXiv preprint arXiv:2603.21864,

  74. [74]

    Magiattention: A distributed attention towards linear scalability for ultra-long context, heterogeneous mask training.https://github.com/SandAI-org/MagiAttention/,

    Tao Zewei and Huang Yunpeng. Magiattention: A distributed attention towards linear scalability for ultra-long context, heterogeneous mask training.https://github.com/SandAI-org/MagiAttention/,

  75. [75]

    8, 30 24 Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation

  76. [76]

    Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse- linear attention.arXiv preprint arXiv:2509.24006, 2025

    Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, et al. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse- linear attention.arXiv preprint arXiv:2509.24006, 2025. 19

  77. [77]

    Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

    Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E Gonzalez, Jianfei Chen, and Jun Zhu. Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025. 19

  78. [78]

    Sla2: Sparse-linear attention with learnable routing and qat.arXiv preprint arXiv:2602.12675, 2026

    Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, and Joseph E Gonzalez. Sla2: Sparse-linear attention with learnable routing and qat.arXiv preprint arXiv:2602.12675, 2026. 19

  79. [79]

    Causal forcing++: Scalable few-step autoregressive diffusion distillation for real-time interactive video generation.arXiv preprint arXiv:2605.15141, 2026

    Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, and Jun Zhu. Causal forcing++: Scalable few-step autoregressive diffusion distillation for real-time interactive video generation.arXiv preprint arXiv:2605.15141, 2026. 18, 19

  80. [80]

    Pytorch fsdp: experiences on scaling fully sharded data parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023. 11

Showing first 80 references.