pith. sign in

arxiv: 2602.02214 · v2 · pith:77QS24PVnew · submitted 2026-02-02 · 💻 cs.CV

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Pith reviewed 2026-05-21 17:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive diffusion distillationvideo generationcausal attentionODE initializationreal-time videodiffusion modelsinteractive generationmodel distillation
0
0 comments X

The pith

Autoregressive teachers enable correct ODE initialization for distilling high-quality real-time interactive video generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that distilling few-step autoregressive video models from pretrained bidirectional diffusion teachers creates an architectural mismatch because bidirectional models lack the frame-level injectivity required for ODE initialization to recover the teacher's flow map. Instead the process collapses to a conditional-expectation solution that harms output quality. By switching to an autoregressive teacher for the initialization step, the proposed method restores the correct flow and produces better video results. A reader would care because this fixes a core obstacle to practical real-time interactive video generation from diffusion models.

Core claim

Distilling an autoregressive student from a bidirectional teacher violates frame-level injectivity under the teacher's probability flow ODE, so ODE initialization recovers only a conditional expectation rather than the teacher's flow map; an autoregressive teacher satisfies injectivity and therefore allows the initialization to recover the flow map, bridging the causal-attention gap and yielding higher-quality few-step generation.

What carries the argument

Frame-level injectivity under an AR teacher's probability flow ODE, which ensures ODE initialization recovers the exact flow map instead of a conditional expectation.

If this is right

  • AR students initialized from AR teachers outperform all baselines on dynamic degree, vision reward, and instruction following.
  • The method surpasses the prior state-of-the-art Self Forcing by 19.3 percent in dynamic degree, 8.7 percent in vision reward, and 16.7 percent in instruction following.
  • Few-step autoregressive video models achieve higher visual quality without increasing inference latency.
  • The architectural gap between bidirectional and causal attention is closed for diffusion distillation pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same injectivity requirement may limit distillation success in other causal sequence domains such as audio or long-horizon planning.
  • Future distillation pipelines could combine an AR teacher initialization with additional consistency losses to push step counts even lower.
  • The result suggests that attention-direction alignment between teacher and student is a general prerequisite for flow-map recovery in diffusion-based generators.

Load-bearing premise

An autoregressive teacher satisfies frame-level injectivity under its probability flow ODE so that initialization recovers the teacher's flow map rather than collapsing to a conditional expectation.

What would settle it

An experiment that measures whether an AR student initialized from a bidirectional teacher produces outputs whose statistics match a conditional expectation of the teacher's trajectories, while the same student initialized from an AR teacher matches the teacher's full flow map on identical prompts.

read the original abstract

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Causal Forcing, a distillation technique that converts pretrained bidirectional video diffusion models into few-step autoregressive (AR) models for real-time interactive video generation. It diagnoses the failure of prior ODE-based distillation as arising from a violation of frame-level injectivity in the probability-flow ODE (PF-ODE) when a bidirectional teacher is used, which collapses the solution to a conditional expectation rather than recovering the teacher's flow map. By switching to an AR teacher for the ODE initialization step, the method is claimed to close the architectural gap between full and causal attention. The paper reports consistent outperformance over baselines, including gains of 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following relative to the prior SOTA Self Forcing approach.

Significance. If the injectivity-based diagnosis is substantiated and the reported gains prove robust, the work supplies a mechanistically motivated route to high-quality AR video diffusion that could accelerate deployment of interactive generation systems. The public release of code and the project page constitute a clear reproducibility asset.

major comments (2)
  1. Abstract and §3 (theoretical justification): the central claim that bidirectional teachers violate frame-level injectivity under the PF-ODE while AR teachers satisfy it is load-bearing for the architectural-gap argument. The manuscript asserts this property for AR models but supplies neither a derivation showing that causal attention preserves injectivity nor a numerical verification that accumulated noise across frames does not destroy uniqueness. Without such support, the mechanistic explanation remains an untested assumption; if the AR case is also non-injective, the reported improvements could be attributable to training details rather than the proposed initialization.
  2. §4 (empirical evaluation): the abstract states that Causal Forcing outperforms all baselines across all metrics, yet the manuscript does not report variance across multiple random seeds, statistical significance tests, or ablation isolating the ODE-initialization component from other training choices. These controls are necessary to confirm that the 19.3% Dynamic Degree gain is attributable to the claimed injectivity mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to strengthen the theoretical and empirical sections.

read point-by-point responses
  1. Referee: Abstract and §3 (theoretical justification): the central claim that bidirectional teachers violate frame-level injectivity under the PF-ODE while AR teachers satisfy it is load-bearing for the architectural-gap argument. The manuscript asserts this property for AR models but supplies neither a derivation showing that causal attention preserves injectivity nor a numerical verification that accumulated noise across frames does not destroy uniqueness. Without such support, the mechanistic explanation remains an untested assumption; if the AR case is also non-injective, the reported improvements could be attributable to training details rather than the proposed initialization.

    Authors: We thank the referee for this observation. Section 3 explains that bidirectional attention permits future-frame information to affect the current frame, violating frame-level injectivity of the PF-ODE and yielding a conditional-expectation solution instead of the teacher flow map. Causal attention restricts dependencies to past and present frames, preserving injectivity. While the initial submission did not contain an explicit derivation or additional numerical check, we can supply a short derivation based on the masking properties of causal attention and will add a targeted numerical verification (e.g., checking uniqueness of recovered clean frames under accumulated noise). We will revise §3 accordingly. revision: yes

  2. Referee: §4 (empirical evaluation): the abstract states that Causal Forcing outperforms all baselines across all metrics, yet the manuscript does not report variance across multiple random seeds, statistical significance tests, or ablation isolating the ODE-initialization component from other training choices. These controls are necessary to confirm that the 19.3% Dynamic Degree gain is attributable to the claimed injectivity mechanism.

    Authors: We agree that stronger statistical controls and targeted ablations would improve confidence in the results. In the revision we will report means and standard deviations over multiple random seeds for the main metrics and include appropriate significance tests. We will also add an ablation that fixes all other training choices and varies only the teacher used for ODE initialization (AR versus bidirectional), thereby isolating the contribution of the proposed initialization step. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central argument rests on the stated property that AR teachers satisfy frame-level injectivity under the PF-ODE (allowing ODE initialization to recover the flow map) while bidirectional teachers do not. This assumption is used to motivate Causal Forcing and is presented as the mechanistic reason for success over prior distillation methods. However, the claimed performance gains are supported by direct empirical comparisons against baselines including Self Forcing, with specific metric improvements reported. No equation, fitted parameter, or self-citation reduces any 'prediction' or result to an input by construction; the injectivity claim functions as an external modeling assumption rather than a self-referential fit or renamed known result. The overall chain remains independent of the target metrics and does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method depends on the existence of an AR teacher whose PF-ODE is frame-level injective; this is treated as given rather than derived or measured in the abstract.

axioms (1)
  • domain assumption Frame-level injectivity holds for the PF-ODE of an autoregressive teacher but not for a bidirectional teacher.
    Invoked to explain why bidirectional initialization collapses to conditional expectation while AR initialization recovers the flow map.

pith-pipeline@v0.9.0 · 5752 in / 1208 out tokens · 35533 ms · 2026-05-21T17:23:58.085238+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Q-ARVD: Quantizing Autoregressive Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.

  2. Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.

  3. LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...

  4. CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

    cs.CV 2026-05 unverdicted novelty 7.0

    CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

  5. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  6. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  7. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  8. DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    DySink uses adaptive retrieval of relevant historical frames plus a sink anomaly gate to improve dynamic degree and temporal quality in minute-long autoregressive video generation.

  9. Xiaomi EV World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    Xiaomi EV World Model integrates WorldRec for sparse-query 3D Gaussian reconstruction and WorldGen for fast causal video generation via bidirectional pretraining and causal fine-tuning to support autonomous driving si...

  10. FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

    cs.CV 2026-05 unverdicted novelty 6.0

    FashionChameleon achieves interactive multi-garment video customization in real time by training a teacher model with in-context learning on single-garment pairs, applying streaming distillation, and using training-fr...

  11. PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...

  12. Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

  13. Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.

  14. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  15. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  16. Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.

  17. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  18. Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

    cs.CV 2026-05 unverdicted novelty 5.0

    Focused Forcing is a training-free per-frame KV selection method that combines attention scores with diversity metrics and head-importance estimation to accelerate autoregressive video diffusion up to 1.48x while impr...

  19. A Systematic Post-Train Framework for Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

  20. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 20 Pith papers · 32 internal anchors

  1. [1]

    Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233,

    Bao, F., Xiang, C., Yue, G., He, G., Zhu, H., Zheng, K., Zhao, M., Liu, S., Wang, Y ., and Zhu, J. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233,

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., and Kreis, K. Align you...

  3. [3]

    SkyReels-V2: Infinite-length Film Generative Model

    Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al. Skyreels- v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074,

  4. [4]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Chen, H., Xia, M., He, Y ., Zhang, Y ., Cun, X., Yang, S., Xing, J., Liu, Y ., Chen, Q., Wang, X., et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512,

  5. [5]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y ., and Hsieh, C.-J. Self-forcing++: To- wards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283,

  6. [6]

    Autoregressive Video Generation without Vector Quantization

    Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y ., Lu, H., Shan, S., Qi, Y ., and Wang, X. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169,

  7. [7]

    Vidarc: Embodied video diffusion model for closed-loop control.arXiv preprint arXiv:2512.17661,

    Feng, Y ., Xiang, C., Mao, X., Tan, H., Zhang, Z., Huang, S., Zheng, K., Liu, H., Su, H., and Zhu, J. Vidarc: Embodied video diffusion model for closed-loop control.arXiv preprint arXiv:2512.17661,

  8. [8]

    Geng, Z., Pokle, A., Luo, W., Lin, J., and Kolter, J. Z. Consistency models made easy.arXiv preprint arXiv:2406.14548,

  9. [9]

    Mean Flows for One-step Generative Modeling

    Geng, Z., Deng, M., Bai, X., Kolter, J. Z., and He, K. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447,

  10. [10]

    End-to-end training for au- toregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

    Guo, Y ., Yang, C., He, H., Zhao, Y ., Wei, M., Yang, Z., Huang, W., and Lin, D. End-to-end training for au- toregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

  11. [11]

    LTX-Video: Realtime Video Latent Diffusion

    HaCohen, Y ., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

  12. [12]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    He, Y ., Yang, T., Zhang, Y ., Shan, Y ., and Chen, Q. La- tent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

  13. [13]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303,

  14. [14]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., and Tang, J. Cogvideo: Large-scale pretraining for text-to-video gener- ation via transformers.arXiv preprint arXiv:2205.15868,

  15. [15]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

    Hong, Y ., Mei, Y ., Ge, C., Xu, Y ., Zhou, Y ., Bi, S., Hold- Geoffroy, Y ., Roberts, M., Fisher, M., Shechtman, E., et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040,

  16. [16]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., and Shechtman, E. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025a. Huang, Y ., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al. Live avatar: Streaming real-time audio-driven avatar generation with infinite len...

  17. [17]

    Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

    Jin, Y ., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y ., Mu, Y ., and Lin, Z. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,

  18. [18]

    Ki, T., Jang, S., Jo, J., Yoon, J., and Hwang, S. J. Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664,

  19. [19]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V ., Yan, J., Chiu, M.-C., et al. Videopoet: A large language model for zero- shot video generation.arXiv preprint arXiv:2312.14125,

  20. [20]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

  21. [21]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Lin, B., Ge, Y ., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y ., Yuan, S., Chen, L., et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

  22. [22]

    Flow Matching for Generative Modeling

    Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  23. [23]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Liu, K., Hu, W., Xu, J., Shan, Y ., and Lu, S. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161,

  24. [24]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,

  25. [25]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Lu, C. and Song, Y . Simplifying, stabilizing and scal- ing continuous-time consistency models.arXiv preprint arXiv:2410.11081,

  26. [26]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Luo, S., Tan, Y ., Huang, L., Li, J., and Zhao, H. La- tent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023a. Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., and Zhang, Z. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.Advances in Neural In...

  27. [27]

    Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096,

    10 Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation Mao, X., Li, Z., Li, C., Xu, X., Ying, K., He, T., Pang, J., Qiao, Y ., and Zhang, K. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096,

  28. [28]

    R., Chen, C., and Wetzstein, G

    Po, R., Chan, E. R., Chen, C., and Wetzstein, G. Bag- ger: Backwards aggregation for mitigating drift in au- toregressive video diffusion models.arXiv preprint arXiv:2512.12080,

  29. [29]

    Movie Gen: A Cast of Media Foundation Models

    Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.-Y ., Chuang, C.-Y ., et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  30. [30]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

  31. [31]

    Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

    Shin, J., Li, Z., Zhang, R., Zhu, J.-Y ., Park, J., Shechtman, E., and Huang, X. Motionstream: Real-time video gen- eration with interactive motion controls.arXiv preprint arXiv:2511.01266,

  32. [32]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al. Make-a- video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,

  33. [33]

    History-Guided Video Diffusion

    Song, K., Chen, B., Simchowitz, M., Du, Y ., Tedrake, R., and Sitzmann, V . History-guided video diffusion.arXiv preprint arXiv:2502.06764,

  34. [34]

    Improved Techniques for Training Consistency Models

    Song, Y . and Dhariwal, P. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189,

  35. [35]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

  36. [36]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Sun, W., Zhang, H., Wang, H., Wu, J., Wang, Z., Wang, Z., Wang, Y ., Zhang, J., Wang, T., and Guo, C. World- play: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025a. Sun, Z., Peng, Z., Ma, Y ., Chen, Y ., Zhou, Z., Zhou, Z., Zhang, G., Zhang, Y ., Zhou, Y ., Lu, Q., et al. Streama- vata...

  37. [37]

    MAGI-1: Autoregressive Video Generation at Scale

    Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al. Magi-1: Au- toregressive video generation at scale.arXiv preprint arXiv:2505.13211,

  38. [38]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  39. [39]

    Scal- ing autoregressive video models.arXiv preprint arXiv:1906.02634,

    Weissenborn, D., T ¨ackstr¨om, O., and Uszkoreit, J. Scal- ing autoregressive video models.arXiv preprint arXiv:1906.02634,

  40. [40]

    Godiva: Generating open- domain videos from natural descriptions.arXiv preprint arXiv:2104.14806,

    Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., Sapiro, G., and Duan, N. Godiva: Generating open- domain videos from natural descriptions.arXiv preprint arXiv:2104.14806,

  41. [41]

    Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784,

    Wu, X., Zhang, G., Xu, Z., Zhou, Y ., Lu, Q., and He, X. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784,

  42. [42]

    Sparse videogen: Accelerat- ing video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025

    Xi, H., Yang, S., Zhao, Y ., Xu, C., Li, M., Li, X., Lin, Y ., Cai, H., Zhang, J., Li, D., et al. Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

  43. [43]

    Knot forcing: Taming autoregressive video diffusion models for real-time infinite interactive portrait animation.arXiv preprint arXiv:2512.21734,

    Xiao, S., Zhang, X., Meng, D., Wang, Q., Zhang, P., and Zhang, B. Knot forcing: Taming autoregressive video diffusion models for real-time infinite interactive portrait animation.arXiv preprint arXiv:2512.21734,

  44. [44]

    VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    Xu, J., Huang, Y ., Cheng, J., Yang, Y ., Xu, J., Wang, Y ., Duan, W., Yang, S., Jin, Q., Li, S., et al. Visionre- ward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059,

  45. [45]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Yan, W., Zhang, Y ., Abbeel, P., and Srinivas, A. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157,

  46. [46]

    LongLive: Real-time Interactive Long Video Generation

    Yang, S., Huang, W., Chu, R., Xiao, Y ., Zhao, Y ., Wang, X., Li, M., Xie, E., Chen, Y ., Lu, Y ., et al. Longlive: Real- time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025a. Yang, Y ., Huang, H., Peng, X., Hu, X., Luo, D., Zhang, J., Wang, C., and Wu, Y . Towards one-step causal video generation via adversarial self-distillation...

  47. [47]

    H., Nam, J., Yoon, H., and Kim, S

    Yi, J., Jang, W., Cho, P. H., Nam, J., Yoon, H., and Kim, S. Deep forcing: Training-free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081,

  48. [48]

    Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025a

    Zhao, M., He, G., Chen, Y ., Zhu, H., Li, C., and Zhu, J. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025a. Zhao, M., Wang, R., Bao, F., Li, C., and Zhu, J. Con- trolvideo: conditional control for one-shot text-driven video editing and beyond.Science China Information Sciences, 68(3):1321...

  49. [49]

    Open-Sora: Democratizing Efficient Video Production for All

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y ., Li, T., and You, Y . Open-sora: Democratiz- ing efficient video production for all.arXiv preprint arXiv:2412.20404,

  50. [50]

    12 Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation A. Extended Related Work Video Generative Models.Building on the tremendous success of diffusion models, many works have applied them to video generation (He et al., 2022; Ho et al., 2022; Singer et al., 2022; Blattmann et al., 2023a...

  51. [51]

    and Wan2.1 (Wan et al., 2025). Apart from the full-sequence diffusion models, some works adopt autoregressive next-token prediction to enable video generation (Wu et al., 2021; Hong et al., 2022; Wu et al., 2022; Weissenborn et al., 2019; Yan et al., 2021; Zhao et al., 2025c;a), such as NOV A (Deng et al.,

  52. [52]

    Video generation based on full-sequence diffusion models currently achieves better overall quality than autoregressive next-token prediction

    and VideoPoet (Kondratyuk et al., 2023). Video generation based on full-sequence diffusion models currently achieves better overall quality than autoregressive next-token prediction. However, full-sequence diffusion models must generate all frames in one shot, which incurs substantial latency and prevents displaying frames to users as they are produced, h...

  53. [53]

    Such real-time, interactive video generation models are highly promising and have broad applications across many domains

    and Self Forcing (Huang et al., 2025a) introduce distillation strategies to obtain few-step generation models. Such real-time, interactive video generation models are highly promising and have broad applications across many domains. One prominent application is video world modeling. HY-WorldPlay (Sun et al., 2025a), RELIC (Hong et al., 2025), Hunyuan-Game...

  54. [54]

    This interactive world-modeling paradigm further enables embodied intelligence, such as closed-loop control in Vidarc (Feng et al., 2025)

    train real-time interactive video models for realistic world simulation, allowing users to freely explore and take actions in the simulated environment. This interactive world-modeling paradigm further enables embodied intelligence, such as closed-loop control in Vidarc (Feng et al., 2025). Another major application lies in entertainment and media, suppor...

  55. [55]

    Equivalently, P(Var(ϕ(xt, t)u |x u t , t)>0)>0

    imply the following: for the above z1,z 2, in a neighborhood of z2 there exist uncountably many zk, each of which maps to a distinct ϕ(xt, t)u, just as z2 does. Equivalently, P(Var(ϕ(xt, t)u |x u t , t)>0)>0. We next prove Proposition 3.3. First, we formalize this in the following statement. Proposition B.2(Distribution mismatch in chunk-wise regression)....

  56. [56]

    More Discussion of Our Method C.1

    51 3.336 22 C. More Discussion of Our Method C.1. Further Remarks on Autoregressive Diffusion Training Strategies In this section, we first provide further remarks on diffusion forcing, and then report results for other training strategies, including PFVG (Wu et al., 2025), BAgger (Po et al., 2025), and Resampling Forcing (Guo et al., 2025). As stated in ...

  57. [57]

    and recent works (e.g., LiveAvatar (Huang et al., 2025b)). Apart from diffusion forcing and teacher forcing, we also experiment with several recent alternatives, including PFVG (Wu et al., 2025), BAgger (Po et al., 2025), and Resampling Forcing (Guo et al., 2025). However, as shown in Tab. 3, these methods provide no significant improvement over teacher f...

  58. [58]

    However, since we use flow matching, i.e., av-prediction parameterization for the diffusion 18 Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation Type Generated video Asymmetric CD … Causal CD … Figure 10.Comparison between asymmetric CD and causal CD.Asymmetric CD appears highly blurry...