pith. sign in

arxiv: 2606.03971 · v1 · pith:VDSO5MNInew · submitted 2026-06-02 · 💻 cs.CV

Video-Mirai: Autoregressive Video Diffusion Models Need Foresight

Pith reviewed 2026-06-28 10:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive video diffusioncausal video generationforesight supervisionrepresentation planning gaplong video consistencytraining-only methodstreaming video models
0
0 comments X

The pith

Causal video generators close their planning gap when future frames supervise representations during training only.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that autoregressive video diffusion models create states that explain the current segment well but often discard identity, layout, and motion details required for consistent future segments. This representation-level planning gap arises because standard training never asks causal states to account for what comes next. Video-Mirai addresses it by rolling out the generator causally, reading the full completed sequence with a frozen non-causal foresight encoder, and distilling the resulting targets into the causal states via a lightweight predictor. Future frames therefore shape the representations while never entering the generator inputs. At inference the added components are removed, so architecture, per-step cost, and KV-cache behavior remain identical. A reader would care because the method shows how to improve long-horizon coherence in streaming video models without paying any runtime price.

Core claim

Video-Mirai is a training-only procedure in which the generator emits a causal rollout, a frozen foresight encoder reads the completed sequence non-causally to produce targets, and a predictor distills those stopped-gradient targets into the causal states so that future frames directly supervise representation quality.

What carries the argument

A non-causal foresight encoder that supplies future-conditioned targets from completed rollouts, distilled by a lightweight predictor into the generator's causal states.

If this is right

  • Total score on 5-second VBench rises from 83.8 to 84.6.
  • Subject consistency on 30-second rollouts rises from 84.9 to 88.5.
  • Background consistency on 30-second rollouts rises from 90.2 to 91.9.
  • Future frames become more linearly decodable from current causal features.
  • Removing future-conditioned targets eliminates the measured gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of causal inference from non-causal supervision could be tested in autoregressive models for other modalities such as audio or 3D.
  • Longer training horizons might amplify the consistency benefit if the foresight encoder can be scaled without increasing inference cost.
  • The approach suggests that many autoregressive generators may improve by adding auxiliary future-prediction objectives that are discarded at test time.

Load-bearing premise

Targets produced by the non-causal foresight encoder on completed rollouts transfer useful information to causal states without introducing training artifacts.

What would settle it

Training the same generator without the foresight-derived targets and measuring whether 30-second subject and background consistency scores remain unchanged.

Figures

Figures reproduced from arXiv: 2606.03971 by Lang Huang, Runyi Li, Toshihiko Yamasaki, Yonghao Yu, Zerun Wang.

Figure 1
Figure 1. Figure 1: Video-Mirai makes the future more decodable. An MLP readout reconstructs future RGB from the frozen causal generator’s current hidden state. Left to right: current frame, baseline readout, Video-Mirai readout (ours), future frame. blue/red: regions matching the current/future. ∗Corresponding author. arXiv:2606.03971v1 [cs.CV] 2 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative evidence of the planning gap. Frames at t=0s, 2.5s, 5s from the same prompt. The baseline’s local plausibility per segment masks abrupt drift between segments: the bear’s motion resets, and the woman’s identity and background change. Video-Mirai mitigates these breaks. 1 Introduction Streaming video generation has a hidden burden: every emitted segment becomes a promise that future segments mus… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Video-Mirai. We take a three-segment video as an example. The causal DiT denoises X2 from its noisy version conditioned on X1 via KV-cache. A frozen foresight encoder processes the causal DiT’s clean rollout {X1, X2, X3}, including the future segment X3. A predictor maps the causal DiT’s hidden state h2 into the encoder’s space, where the foresight loss aligns it with the encoder’s fused hidden… view at source ↗
Figure 4
Figure 4. Figure 4: Foresight loss comparison. Cosine similarity vs. MSE, with and without the SIGReg. 3 6 9 Prediction Horizon 11 12 13 14 15 16 P S N R ↑ Baseline Video-Mirai Layer 9 Layer 15 Layer 24 Quality Semantic Total 80 81 82 83 84 85 S c o r e ↑ ★ ★ ★ Baseline Cos Cos w/ SIGReg MSE MSE w/ SIGReg [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Video-Mirai internalizes the future distribution. The readout matches the rollout average. 4.2 Component-wise analysis Injection depth. Foresight injection requires choosing both a causal generator layer L and an encoder layer L ′ . Since the causal generator (30 blocks) and the encoder (40 blocks) differ in depth, we pair layers at matched relative depth: L = αDs, L ′ = αDt [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison. Representative frames at t=0s, 2.5s, 5s from the rollouts of the same prompt under the baselines and their Video-Mirai counterparts. Video-Mirai rollouts preserve subject identity and scene composition more reliably across time. Distribution of futures. The probes above show that Video-Mirai’s features support future readout, but the future of a given current is not deterministic. R… view at source ↗
Figure 8
Figure 8. Figure 8: Future-frame readout fidelity across layers and horizons. Solid: Video-Mirai; dashed: Causal-Forcing baseline. Video-Mirai dominates at every layer and horizon. generator’s own rollout, and the two trajectories differ whenever the causal generator is imperfect. Stacked naively, Stage 3 must rewrite the foresight mapping Stage 2 just installed, swapping a clean target for a moving one while simultaneously r… view at source ↗
Figure 9
Figure 9. Figure 9: Video-Mirai’s representations encode the future. An MLP readout reconstructs future RGB from the frozen causal generator’s current hidden state (layer 15). Left to right: current frame, baseline readout, Video-Mirai readout (ours), future frame. blue/red: regions matching the current/future frame. t=0s Causal-Forcing Video-Mirai t=2.5s t=5s Causal-Forcing Video-Mirai [PITH_FULL_IMAGE:figures/full_fig_p019… view at source ↗
Figure 10
Figure 10. Figure 10: More qualitative comparison. Representative frames at t=0s, 2.5s, 5s from the rollouts of the same prompt under the baseline and Video-Mirai counterpart. Video-Mirai rollouts preserve subject identity, scene composition, and motion coherence more reliably across time. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, however, only asks each causal state to explain the present. This creates what we call a representation-level planning gap: states that fit the current segment may discard identity, layout, and motion information needed for a consistent future. We introduce Video-Mirai, a training-only method that closes this gap without changing causal inference: the generator rolls out causally, a frozen foresight encoder reads the completed rollout non-causally, and a lightweight predictor distills the resulting stopped-gradient targets into causal states. Future frames supervise representations, never generator inputs. At inference, the encoder and predictor are discarded, leaving the original architecture, per-step FLOPs, and KV-cache behavior unchanged. Video-Mirai improves a strong Causal-Forcing baseline on 5-second VBench from 83.8 to 84.6 in terms of Total Score. On 30-second rollouts beyond the training horizon, subject consistency improves from 84.9 to 88.5 and background consistency from 90.2 to 91.9. Ablations identify future-conditioned targets as the key ingredient, and probes show that future frames become more decodable from current features. Causality should constrain inference, not representation supervision. Our study highlights that visual autoregressive models need foresight. Project page: https://y0uroy.github.io/Video-Mirai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Video-Mirai, a training-only auxiliary method for autoregressive video diffusion models. The generator produces causal rollouts; a frozen non-causal foresight encoder processes the completed sequence to yield stopped-gradient targets; a lightweight predictor distills those targets into the causal states. Future frames supervise representations only; at inference the encoder and predictor are removed, leaving the original causal architecture, per-step FLOPs, and KV-cache unchanged. Reported gains include a VBench 5-second total score lift from 83.8 to 84.6 and improved subject/background consistency on 30-second rollouts beyond the training horizon.

Significance. If the reported metric improvements prove robust, the approach offers a practical route to close the representation-level planning gap in causal video generators by supplying future-conditioned supervision without altering inference cost or architecture. The training-only nature and explicit ablations on future targets are positive features.

major comments (2)
  1. [Abstract] Abstract and method description provide no information on the foresight encoder's training data, architecture, or exposure to generated video; this leaves the distribution-shift concern (non-causal targets computed on model rollouts versus real-video training) unaddressed and makes it impossible to determine whether the observed 30 s consistency gains (subject 84.9→88.5, background 90.2→91.9) reflect foresight transfer or incidental regularization.
  2. [Abstract] No implementation details, error bars, statistical tests, or full baseline descriptions are supplied for the Causal-Forcing baseline or the reported VBench numbers; the soundness assessment therefore remains low and the central claim that future-conditioned targets are the key ingredient cannot yet be evaluated.
minor comments (1)
  1. [Abstract] The abstract states that probes show future frames become more decodable from current features; the corresponding probe design, layer, and quantitative results should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and commitments to revisions where the manuscript can be strengthened without misrepresenting our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description provide no information on the foresight encoder's training data, architecture, or exposure to generated video; this leaves the distribution-shift concern (non-causal targets computed on model rollouts versus real-video training) unaddressed and makes it impossible to determine whether the observed 30 s consistency gains (subject 84.9→88.5, background 90.2→91.9) reflect foresight transfer or incidental regularization.

    Authors: We agree that the abstract and initial method description omitted key details on the foresight encoder. The encoder is a non-causal transformer-based video model with the same architecture as the generator but bidirectional attention; it is pre-trained exclusively on real videos from the training distribution (e.g., WebVid-10M) and remains frozen with no exposure to generated videos at any point. Targets are computed by applying this frozen encoder to the generator's completed causal rollouts (with stopped gradients), which does introduce a distribution shift relative to its real-video training. We will revise the abstract and method section to state these facts explicitly. The paper's ablations already show that ablating the future-conditioned aspect removes the consistency gains, suggesting the benefit is not incidental regularization; we will add a dedicated paragraph discussing the distribution-shift issue and its implications for the 30-second results. revision: yes

  2. Referee: [Abstract] No implementation details, error bars, statistical tests, or full baseline descriptions are supplied for the Causal-Forcing baseline or the reported VBench numbers; the soundness assessment therefore remains low and the central claim that future-conditioned targets are the key ingredient cannot yet be evaluated.

    Authors: We acknowledge that the manuscript lacks sufficient implementation details and statistical rigor for full reproducibility assessment. In revision we will add: a complete description of the Causal-Forcing baseline (standard autoregressive segment-by-segment training without the foresight distillation loss), all relevant hyperparameters, and expanded VBench protocol details including exact evaluation settings. The existing ablations already isolate future-conditioned targets as the key factor by comparing against variants lacking them. However, error bars and statistical tests were not computed in the original experiments. revision: partial

standing simulated objections not resolved
  • Error bars and statistical tests on the reported VBench and consistency metrics, as these were not part of the original experimental protocol and would require new runs.

Circularity Check

0 steps flagged

No circularity: empirical training method with external encoder, no derivations or self-referential reductions

full rationale

The paper presents an empirical training procedure (causal rollout + frozen non-causal encoder + distillation predictor) whose benefits are measured on held-out metrics (VBench scores, long-rollout consistency). No equations, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems appear in the provided text. The foresight encoder is described as external and frozen; inference discards it entirely. The central claim therefore rests on experimental outcomes rather than any definitional or self-referential reduction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, hyperparameters, or modeling choices; therefore no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5820 in / 1162 out tokens · 28084 ms · 2026-06-28T10:25:22.700774+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 11 canonical work pages · 11 internal anchors

  1. [1]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InNeurIPS, volume 37, pages 58757–58791, 2024

  2. [2]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  3. [3]

    Self-supervised learning from images with a joint- embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InCVPR, pages 15619–15629, 2023

  4. [4]

    LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

    Randall Balestriero and Yann LeCun. LeJEPA: Provable and scalable self-supervised learning without the heuristics.arXiv preprint arXiv:2511.08544, 2025

  5. [5]

    Revisiting feature prediction for learning visual repre- sentations from video.TMLR, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.TMLR, 2024

  6. [6]

    Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al

    Jake Bruce, Michael D. Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InICML, pages 4603–4623, 2024

  7. [7]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, volume 37, pages 24081–24125, 2024

  8. [8]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  9. [9]

    Autoregressive video generation without vector quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InICLR, 2025

  10. [10]

    Better & faster large language models via multi-token prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. InICML, pages 15706–15734, 2024

  11. [11]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  12. [12]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 15979–15988, 2022

  13. [13]

    StreamingT2V: Consistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. StreamingT2V: Consistent, dynamic, and extendable long video generation from text. InCVPR, pages 2568–2577, 2025

  14. [14]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  15. [15]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, volume 33, pages 6840–6851, 2020

  16. [16]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2025. 10

  17. [17]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024

  18. [18]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InICLR, 2025

  19. [19]

    FIFO-Diffusion: Generating infinite videos from text without training

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. FIFO-Diffusion: Generating infinite videos from text without training. InNeurIPS, volume 37, pages 89834–89868, 2024

  20. [20]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  21. [21]

    A path towards autonomous machine intelligence.OpenReview preprint, 2022

    Yann LeCun. A path towards autonomous machine intelligence.OpenReview preprint, 2022. Version 0.9.2

  22. [22]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. InICCV, pages 18262–18272, 2025

  23. [23]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023

  24. [24]

    Rolling forcing: Autoregressive long video diffusion in real time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026

  25. [25]

    V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.TMLR, 2024

  26. [26]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023

  27. [27]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  28. [28]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

  29. [29]

    FitNets: Hints for thin deep nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. InICLR, 2015

  30. [30]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

  31. [31]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W.Q. Zhang, Weifeng Luo, et al. MAGI-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211, 2025

  32. [32]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InICLR, 2025

  33. [33]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  34. [34]

    Vidprom: A million-scale real prompt-gallery dataset for text-to- video diffusion models

    Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to- video diffusion models. InNeurIPS, volume 37, pages 65618–65642, 2024. 11

  35. [35]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InICLR, 2025

  36. [36]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Frédo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, volume 37, pages 47455–47487, 2024

  37. [37]

    Freeman, and Taesung Park

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, pages 6613–6623, 2024

  38. [38]

    Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In CVPR, pages 22963–22974, 2025

  39. [39]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025

  40. [40]

    Mirai: Autoregressive Visual Generation Needs Foresight

    Yonghao Yu, Lang Huang, Zerun Wang, Runyi Li, and Toshihiko Yamasaki. Mirai: Autoregres- sive visual generation needs foresight.arXiv preprint arXiv:2601.14671, 2026

  41. [41]

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026. 12 Appendix A Implementation Details Model architecture.The causal generator backbone is Wan2.1-T2V-1.3B [ 33]: 30 transformer...