HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

Conglang Zhang; Qian Zhang; Qingjie Wang; Weiqiang Ren; Wei Yin; Xiaoyang Guo; Yifan Zhan; Yinqiang Zheng; Yu Li; Zhanpeng Ouyang

arxiv: 2605.11596 · v2 · pith:326UJHUXnew · submitted 2026-05-12 · 💻 cs.CV

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

Conglang Zhang , Yifan Zhan , Qingjie Wang , Zhanpeng Ouyang , Yu Li , Zihao Yang , Xiaoyang Guo , Weiqiang Ren

show 5 more authors

Qian Zhang Zhen Dong Yinqiang Zheng Wei Yin Zhengqing Chen

This is my paper

Pith reviewed 2026-05-25 06:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords autoregressiveworld modeldriving simulationlong-horizonrollout recoverynuScenesvideo distillation

0 comments

The pith

HorizonDrive makes autoregressive driving world models stable for minute-scale rollouts by training a self-corrective teacher.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to extend driving world models to long-horizon autoregressive generation without drift or high memory use. It does this by first training the teacher model with scheduled rollout recovery to handle corrupted prediction histories and stay aligned with ground truth. This allows the teacher to generate long-horizon supervision signals through its own rollouts, which a student model then matches efficiently. A sympathetic reader would care because current methods are limited to short clips, restricting their use in realistic closed-loop driving simulations that require sustained interaction.

Core claim

HorizonDrive is an anti-drifting training-and-distillation framework for autoregressive driving simulation. Scheduled rollout recovery trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, producing a teacher stable across long autoregressive rollouts. The rollout-capable teacher then extends via autoregressive rollout to provide long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD for real-time deployment.

What carries the argument

Scheduled rollout recovery (SRR), which enables the teacher to remain stable across long autoregressive rollouts by reconstructing from corrupted histories.

If this is right

HorizonDrive natively supports minute-scale AR rollout under bounded memory.
On nuScenes it reduces FID by 52% and FVD by 37% relative to strongest long-horizon streaming baselines.
It lowers ARE and DTW by 21% and 9%.
It remains competitive with single-pass driving video generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could generalize to other domains like robotics simulation where long-term consistency is needed.
Improved long-horizon models might lead to better training data for reinforcement learning in driving agents.
Testing on real-world driving datasets beyond nuScenes would validate broader applicability.

Load-bearing premise

Scheduled rollout recovery can train a stable teacher without the recovery process itself creating distribution shifts that invalidate the long-horizon supervision.

What would settle it

If after applying SRR the autoregressive rollouts still accumulate errors rapidly beyond short horizons on the nuScenes dataset, the claim of stable long-horizon supervision would be falsified.

Figures

Figures reproduced from arXiv: 2605.11596 by Conglang Zhang, Qian Zhang, Qingjie Wang, Weiqiang Ren, Wei Yin, Xiaoyang Guo, Yifan Zhan, Yinqiang Zheng, Yu Li, Zhanpeng Ouyang, Zhen Dong, Zhengqing Chen, Zihao Yang.

**Figure 1.** Figure 1: Comparison with general long-video generators and driving world models. General long-video generators can roll out but lack driving-specific control and suffer from drift, while existing driving world models cannot roll out autoregressively. HorizonDrive enables both action-controllable generation and stable long-horizon AR rollout, supporting real-time interactive driving simulation. stability, the correc… view at source ↗

**Figure 2.** Figure 2: Overview of HorizonDrive framework. We first train a conditional driving world model, then improve its autoregressive stability through scheduled rollout recovery, and finally distill longhorizon teacher rollouts into a few-step, short-chunk student via teacher-rollout DMD. Distribution matching distillation. Due to the slow inference speed of diffusion models, Distribution Matching Distillation (DMD) [Y… view at source ↗

**Figure 3.** Figure 3: Details of scheduled rollout recovery. (a) Boundary-decay sampling gradually shifts training from late, semantically drifted rollout regions to earlier, more generic degradation, while pred-to-GT transition smooths the recovery target. (b) Error heatmaps reveal stronger semantic corruption at later rollout intervals. (c) Cross-case similarity shows that earlier errors are more consistent, supporting the pr… view at source ↗

**Figure 4.** Figure 4: Long-horizon rollout comparison with streaming video generation methods. Our method preserves clearer scene structure, more stable object geometry, and better visual quality over time. See “zoom-in” for better details [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Long-horizon generation quality comparison. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison with long-horizon streaming baselines on nuScenes val (scene 1). From left to right: HorizonDrive, Self-Forcing, Self-Forcing++, LongLive [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison with long-horizon streaming baselines on nuScenes val (scene 2). time Ours Self-Forcing Self-Forcing++ LongLive 0s 3s 7s 11s 15s 19s [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison with long-horizon streaming baselines on nuScenes val (scene 3). Same layout as [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison with driving world models on nuScenes val (scene 1). From left to right: HorizonDrive, Helios, Matrix-Game3, LingBot-World time 0s 3s 7s 11s 15s 19s Ours Helios Matrix-Game-3 Lingbot-world [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison with driving world models on nuScenes val (scene 2). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison with long-horizon streaming baselines on the self-collected (e2e) dataset (scene 1). From left to right: HorizonDrive, Self-Forcing, Self-Forcing++, LongLive time Ours Self-Forcing Self-Forcing++ LongLive 0s 3s 7s 11s 15s 19s [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative rollouts on the self-collected (e2e) dataset (scene 2). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Minute-level autoregressive generation on the self-collected (e2e) dataset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Closed-loop driving simulation. A planner consumes the latest generated frame at each step and outputs an ego trajectory, which HorizonDrive uses as the next-step action condition. Despite the planner-and-world-model loop being driven entirely by self-generated signals, HorizonDrive maintains coherent scene structure and stable agent behavior over long horizons. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_… view at source ↗

read the original abstract

Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SRR trains a rollout-stable teacher for long-horizon DMD supervision in driving AR models, delivering clear FID/FVD gains on nuScenes, but the abstract leaves the distribution-shift risk in SRR unaddressed.

read the letter

The main contribution is making the teacher itself AR-capable through scheduled rollout recovery so its own long rollouts can serve as supervision targets for the student via teacher rollout DMD. This sidesteps the horizon limits of student-degradation methods and the poor transfer of frame sinks in fast ego-motion scenes. The nuScenes results show substantial drops in FID and FVD plus smaller but consistent gains on ARE and DTW against streaming baselines, while matching single-pass generators. That combination is concrete and directly tackles a known bottleneck in closed-loop simulation. The approach is new relative to the cited alternatives and the training procedures do not appear to rest on circular reuse of fitted quantities. The main soft spot is that the abstract gives no evidence the SRR corruption schedule matches the actual error accumulation over minute-scale rollouts, so it is possible the recovered histories introduce their own bias or artifacts that weaken the supervision signal. No error bars, dataset details, or baseline re-implementation notes are provided either. This is for groups working on autoregressive world models for autonomous driving. The problem is real, the fix is explicit, and the numbers are large enough to warrant referee time even if the evidence needs more scrutiny on robustness.

Referee Report

2 major / 2 minor

Summary. The paper introduces HorizonDrive, a self-corrective autoregressive world model framework for long-horizon driving simulation. It proposes scheduled rollout recovery (SRR) to train a rollout-stable teacher from prediction-corrupted histories and teacher rollout DMD (TRD) to provide long-horizon distribution-matching supervision to a short-window student, claiming this enables minute-scale AR rollouts under bounded memory. On nuScenes, it reports 52% FID and 37% FVD reductions plus 21% ARE and 9% DTW improvements over long-horizon streaming baselines while remaining competitive with single-pass generators.

Significance. If the central claims hold, the work would advance closed-loop driving simulation by addressing the supervision-horizon bottleneck in AR distillation methods. The explicit mechanisms for making the teacher rollout-capable without unbounded memory represent a targeted contribution to autoregressive video/world models in dynamic scenes with fast ego-motion.

major comments (2)

[Abstract / §3 (SRR)] Abstract / §3 (SRR description): the claim that SRR produces a teacher whose AR rollouts supply valid long-horizon supervision rests on the unverified assumption that the scheduled corruption matches the actual error-accumulation statistics of minute-scale rollout; no analysis is provided that the recovered histories remain on-manifold with real driving video or that the resulting targets do not introduce distribution shift.
[Results (nuScenes evaluation)] Results (nuScenes metrics): the reported relative gains (FID −52 %, FVD −37 %, ARE −21 %, DTW −9 %) are stated without error bars, explicit dataset splits, or implementation details of the strongest long-horizon streaming baselines, preventing assessment of whether the improvements are robust or sensitive to post-hoc choices.

minor comments (2)

[Abstract] Acronyms (AR, SRR, TRD, DMD, FID, FVD, ARE, DTW) should be defined at first use in the abstract for readability.
[Method] Notation for the corruption schedule and the DMD objective could be made more explicit to allow direct comparison with prior distillation methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the assumptions in SRR and the experimental reporting. We address each point below and will incorporate clarifications and additional analysis in the revision.

read point-by-point responses

Referee: [Abstract / §3 (SRR)] Abstract / §3 (SRR description): the claim that SRR produces a teacher whose AR rollouts supply valid long-horizon supervision rests on the unverified assumption that the scheduled corruption matches the actual error-accumulation statistics of minute-scale rollout; no analysis is provided that the recovered histories remain on-manifold with real driving video or that the resulting targets do not introduce distribution shift.

Authors: We agree that a direct verification of error statistics and manifold consistency would strengthen the claim. The SRR schedule is constructed to progressively increase corruption levels to approximate drift accumulation, but the manuscript does not quantify the match to minute-scale rollout errors. In the revision we will add (i) a comparison of per-frame prediction error distributions under SRR versus standard AR rollout on held-out sequences, and (ii) qualitative and quantitative checks (e.g., LPIPS and semantic segmentation consistency) confirming that recovered histories remain on-manifold. These additions will be placed in §3 and the supplementary material. revision: yes
Referee: [Results (nuScenes evaluation)] Results (nuScenes metrics): the reported relative gains (FID −52 %, FVD −37 %, ARE −21 %, DTW −9 %) are stated without error bars, explicit dataset splits, or implementation details of the strongest long-horizon streaming baselines, preventing assessment of whether the improvements are robust or sensitive to post-hoc choices.

Authors: We acknowledge the need for greater transparency. The current manuscript reports point estimates without variance or split details. In the revised version we will (i) report means and standard deviations over three random seeds, (ii) explicitly state the nuScenes train/val/test split indices used, and (iii) provide a supplementary table with baseline implementation details, including training hyperparameters and how long-horizon streaming was realized for each comparator. These changes will appear in §4 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new training procedures

full rationale

The paper's central contribution is the introduction of scheduled rollout recovery (SRR) and teacher rollout DMD (TRD) as novel training and distillation procedures that extend a base model to long-horizon AR stability. These are described as training steps that reconstruct from corrupted histories and align a student to teacher rollouts, without any quoted equations or claims that reduce performance metrics (FID, FVD, ARE, DTW) to previously fitted parameters by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no predictions are statistically forced from input fits. The method is therefore self-contained against external benchmarks on nuScenes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of concrete free parameters, axioms, or invented entities; SRR and TRD likely involve schedule hyperparameters and loss weighting choices that function as free parameters, but none are named or quantified.

pith-pipeline@v0.9.0 · 5863 in / 1304 out tokens · 22824 ms · 2026-05-25T06:29:15.985995+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 18 internal anchors

[1]

Building Normalizing Flows with Stochastic Interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

work page arXiv
[4]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Magicdrive3d: Controllable 3d generation for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024a

Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive3d: Controllable 3d generation for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024a. Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive ...

work page arXiv
[6]

Mean Flows for One-step Generative Modeling

10 Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

work page arXiv
[8]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a. Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint...

work page arXiv
[10]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212,

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212,

work page arXiv
[12]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield

Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, et al. Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677, 2025a. Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion i...

work page arXiv
[14]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Dreamforge: Motion-aware autoregressive video generation for multi-view driving scenes

Jianbiao Mei, Tao Hu, Xuemeng Yang, Licheng Wen, Yu Yang, Tiantian Wei, Yukai Ma, Min Dou, Botian Shi, and Yong Liu. Dreamforge: Motion-aware autoregressive video generation for multi-view driving scenes. arXiv preprint arXiv:2409.04003,

work page arXiv
[16]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,

work page arXiv
[17]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

URLhttps://arxiv.org/abs/2506.09042. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695,

work page arXiv
[18]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Longcat-video technical report.arXiv preprint arXiv:2510.22200,

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200,

work page arXiv
[21]

Advancing Open-source World Models

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Han- lin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024a. Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting a...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Tianhe Wu, Ruibin Li, Lei Zhang, and Kede Ma. Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024a

12 Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024a. Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesu...

work page arXiv
[28]

Composing driving worlds through disentangled control for adversarial scenario generation.arXiv preprint arXiv:2603.12864,

Yifan Zhan, Zhengqing Chen, Qingjie Wang, Zhuo He, Muyao Niu, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, and Yinqiang Zheng. Composing driving worlds through disentangled control for adversarial scenario generation.arXiv preprint arXiv:2603.12864,

work page arXiv
[29]

Epona: Autoregressive diffusion world model for autonomous driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. arXiv preprint arXiv:2506.24113,

work page arXiv
[30]

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency.arXiv preprint arXiv:2510.08431,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Open-sora: Democratizing efficient video production for all, 2024.URL https://github

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, 2024.URL https://github. com/hpcaitech/Open-Sora,

work page 2024
[32]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

13 A Implementation details Backbone and V AE.HorizonDrive is built on Wan 2.1 1.3B [Wan et al., 2025] with full bidirectional attention, and adopts the disentangled driving-control modules described in Sec. 4.1. Since driving scenes involve fast ego-motion and rapidly changing fine details, we fine-tune the original V AE to reduce its temporal compressio...

work page 2025
[34]

Longer rollouts of up to one minute on a self-collected dataset are demonstrated qualitatively in Sec

Datasets and evaluation horizon.For nuScenes [Caesar et al., 2020], we use 700 multi-view videos of ∼20 seconds for training and 150 for validation; the per-clip length of ∼20 s is the dataset upper bound, which determines the horizon of our quantitative evaluation. Longer rollouts of up to one minute on a self-collected dataset are demonstrated qualitati...

work page 2020
[35]

SRR (G roll) Optimizer AdamW AdamW Learning rate 1e-5 1e-5 Weight decay 1e-2 1e-5 Global batch size 96 64 Mixed precision bf16 bf16 Training steps 40K (proprietary) + 10K (nuScenes) 10K (nuScenes) GPU Usage 96 NVIDIA 5090 64 NVIDIA 5090 Context window lengthT11 11 Chunk sizeK10, 40 10, 40 Resolution [256, 512], [384, 768] [256, 512], [384, 768] AR rollout...

work page 2000
[36]

CFG scaleα6 C Baseline evaluation protocols Group (i): long-horizon interactive world model frameworks.These methods are designed for general open-domain or interactive world simulation and do not natively accept our driving control signals (actions, HD maps, bounding boxes). For each method, we use the publicly released checkpoint, condition it only on t...

work page 2025

[1] [1]

Building Normalizing Flows with Stochastic Interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

work page arXiv

[4] [4]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Magicdrive3d: Controllable 3d generation for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024a

Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive3d: Controllable 3d generation for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024a. Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive ...

work page arXiv

[6] [6]

Mean Flows for One-step Generative Modeling

10 Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

work page arXiv

[8] [8]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a. Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint...

work page arXiv

[10] [10]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212,

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212,

work page arXiv

[12] [12]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield

Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, et al. Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677, 2025a. Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion i...

work page arXiv

[14] [14]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Dreamforge: Motion-aware autoregressive video generation for multi-view driving scenes

Jianbiao Mei, Tao Hu, Xuemeng Yang, Licheng Wen, Yu Yang, Tiantian Wei, Yukai Ma, Min Dou, Botian Shi, and Yong Liu. Dreamforge: Motion-aware autoregressive video generation for multi-view driving scenes. arXiv preprint arXiv:2409.04003,

work page arXiv

[16] [16]

Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,

work page arXiv

[17] [17]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

URLhttps://arxiv.org/abs/2506.09042. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695,

work page arXiv

[18] [18]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Longcat-video technical report.arXiv preprint arXiv:2510.22200,

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200,

work page arXiv

[21] [21]

Advancing Open-source World Models

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Han- lin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024a. Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting a...

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Tianhe Wu, Ruibin Li, Lei Zhang, and Kede Ma. Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024a

12 Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024a. Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesu...

work page arXiv

[28] [28]

Composing driving worlds through disentangled control for adversarial scenario generation.arXiv preprint arXiv:2603.12864,

Yifan Zhan, Zhengqing Chen, Qingjie Wang, Zhuo He, Muyao Niu, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, and Yinqiang Zheng. Composing driving worlds through disentangled control for adversarial scenario generation.arXiv preprint arXiv:2603.12864,

work page arXiv

[29] [29]

Epona: Autoregressive diffusion world model for autonomous driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. arXiv preprint arXiv:2506.24113,

work page arXiv

[30] [30]

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency.arXiv preprint arXiv:2510.08431,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Open-sora: Democratizing efficient video production for all, 2024.URL https://github

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, 2024.URL https://github. com/hpcaitech/Open-Sora,

work page 2024

[32] [32]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

13 A Implementation details Backbone and V AE.HorizonDrive is built on Wan 2.1 1.3B [Wan et al., 2025] with full bidirectional attention, and adopts the disentangled driving-control modules described in Sec. 4.1. Since driving scenes involve fast ego-motion and rapidly changing fine details, we fine-tune the original V AE to reduce its temporal compressio...

work page 2025

[34] [34]

Longer rollouts of up to one minute on a self-collected dataset are demonstrated qualitatively in Sec

Datasets and evaluation horizon.For nuScenes [Caesar et al., 2020], we use 700 multi-view videos of ∼20 seconds for training and 150 for validation; the per-clip length of ∼20 s is the dataset upper bound, which determines the horizon of our quantitative evaluation. Longer rollouts of up to one minute on a self-collected dataset are demonstrated qualitati...

work page 2020

[35] [35]

SRR (G roll) Optimizer AdamW AdamW Learning rate 1e-5 1e-5 Weight decay 1e-2 1e-5 Global batch size 96 64 Mixed precision bf16 bf16 Training steps 40K (proprietary) + 10K (nuScenes) 10K (nuScenes) GPU Usage 96 NVIDIA 5090 64 NVIDIA 5090 Context window lengthT11 11 Chunk sizeK10, 40 10, 40 Resolution [256, 512], [384, 768] [256, 512], [384, 768] AR rollout...

work page 2000

[36] [36]

CFG scaleα6 C Baseline evaluation protocols Group (i): long-horizon interactive world model frameworks.These methods are designed for general open-domain or interactive world simulation and do not natively accept our driving control signals (actions, HD maps, bounding boxes). For each method, we use the publicly released checkpoint, condition it only on t...

work page 2025