pith. machine review for the scientific record. sign in

arxiv: 2605.11596 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive world modelsdriving simulationlong-horizon rolloutself-corrective trainingvideo generationnuScenes
0
0 comments X

The pith

A self-corrective training procedure allows autoregressive driving models to generate minute-scale simulations without drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes HorizonDrive to address drift in autoregressive world models for driving simulation. It trains a teacher model using scheduled rollout recovery so that the teacher can generate reliable long sequences from its own predictions. The student then aligns to the teacher's long-horizon outputs using a distillation method, all while keeping memory use bounded. A reader would care because this could make realistic closed-loop driving tests feasible over extended time periods rather than short clips.

Core claim

HorizonDrive makes the teacher rollout-capable through scheduled rollout recovery, which trains it to reconstruct ground truth from corrupted histories, so that the teacher can supply long-horizon supervision to the student model without drifting.

What carries the argument

Scheduled rollout recovery (SRR) that trains the model to reconstruct ground-truth clips from prediction-corrupted histories.

Load-bearing premise

Scheduled rollout recovery produces a teacher model that stays stable and accurate over long autoregressive rollouts without adding biases absent from the ground truth data.

What would settle it

Running minute-long autoregressive simulations in unseen driving scenarios and checking if visual quality or trajectory accuracy degrades compared to ground truth.

Figures

Figures reproduced from arXiv: 2605.11596 by Conglang Zhang, Qian Zhang, Qingjie Wang, Weiqiang Ren, Wei Yin, Xiaoyang Guo, Yifan Zhan, Yinqiang Zheng, Yu Li, Zhanpeng Ouyang, Zhen Dong, Zhengqing Chen, Zihao Yang.

Figure 1
Figure 1. Figure 1: Comparison with general long-video generators and driving world models. General long-video generators can roll out but lack driving-specific control and suffer from drift, while existing driving world models cannot roll out autoregressively. HorizonDrive enables both action-controllable generation and stable long-horizon AR rollout, supporting real-time interactive driving simulation. stability, the correc… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of HorizonDrive framework. We first train a conditional driving world model, then improve its autoregressive stability through scheduled rollout recovery, and finally distill long￾horizon teacher rollouts into a few-step, short-chunk student via teacher-rollout DMD. Distribution matching distillation. Due to the slow inference speed of diffusion models, Distribu￾tion Matching Distillation (DMD) [Y… view at source ↗
Figure 3
Figure 3. Figure 3: Details of scheduled rollout recovery. (a) Boundary-decay sampling gradually shifts training from late, semantically drifted rollout regions to earlier, more generic degradation, while pred-to-GT transition smooths the recovery target. (b) Error heatmaps reveal stronger semantic corruption at later rollout intervals. (c) Cross-case similarity shows that earlier errors are more consistent, supporting the pr… view at source ↗
Figure 4
Figure 4. Figure 4: Long-horizon rollout comparison with streaming video generation methods. Our method preserves clearer scene structure, more stable object geometry, and better visual quality over time. See “zoom-in” for better details [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Long-horizon generation quality comparison. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with long-horizon streaming baselines on nuScenes val (scene 1). From left to right: HorizonDrive, Self-Forcing, Self-Forcing++, LongLive [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison with long-horizon streaming baselines on nuScenes val (scene 2). time Ours Self-Forcing Self-Forcing++ LongLive 0s 3s 7s 11s 15s 19s [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison with long-horizon streaming baselines on nuScenes val (scene 3). Same layout as [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison with driving world models on nuScenes val (scene 1). From left to right: HorizonDrive, Helios, Matrix-Game3, LingBot-World time 0s 3s 7s 11s 15s 19s Ours Helios Matrix-Game-3 Lingbot-world [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison with driving world models on nuScenes val (scene 2). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison with long-horizon streaming baselines on the self-collected (e2e) dataset (scene 1). From left to right: HorizonDrive, Self-Forcing, Self-Forcing++, LongLive time Ours Self-Forcing Self-Forcing++ LongLive 0s 3s 7s 11s 15s 19s [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative rollouts on the self-collected (e2e) dataset (scene 2). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Minute-level autoregressive generation on the self-collected (e2e) dataset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Closed-loop driving simulation. A planner consumes the latest generated frame at each step and outputs an ego trajectory, which HorizonDrive uses as the next-step action condition. Despite the planner-and-world-model loop being driven entirely by self-generated signals, HorizonDrive maintains coherent scene structure and stable agent behavior over long horizons. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_… view at source ↗
read the original abstract

Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes HorizonDrive, a self-corrective autoregressive world model for long-horizon driving simulation. It introduces scheduled rollout recovery (SRR) to train a stable teacher that reconstructs ground-truth clips from prediction-corrupted histories, followed by teacher rollout DMD (TRD) to distill long-horizon supervision to an efficient student model. The central claims are that this enables minute-scale AR rollouts under bounded memory and yields large gains on nuScenes (52% FID reduction, 37% FVD reduction, 21% ARE and 9% DTW improvement vs. strongest long-horizon streaming baselines) while remaining competitive with single-pass generators.

Significance. If the stability of the SRR-trained teacher under actual long AR rollouts is verified, the framework would meaningfully advance closed-loop driving simulation by removing the horizon limits of prior distillation methods without memory explosion. The reported benchmark gains are substantial and directly address a practical bottleneck in autoregressive video/world models for driving.

major comments (3)
  1. [§3.2] §3.2 (SRR description): The load-bearing claim that SRR yields a teacher whose own long AR rollouts remain distributionally close to GT (enabling reliable TRD supervision) is not directly tested. No metrics are reported for the teacher's standalone minute-scale AR generations (e.g., FID/FVD/ARE on teacher-only rollouts vs. ground truth), leaving the extrapolation from scheduled short-clip recovery to unbounded AR unverified.
  2. [§4] §4 (Experiments): The nuScenes results compare against long-horizon baselines, but the manuscript omits full details on training schedules, exact train/val splits, whether baselines received equivalent data augmentation, and whether teacher rollouts were used only for training or also in evaluation. This weakens the support for the reported 52%/37% gains being attributable to the proposed method rather than implementation differences.
  3. [§3.3] §3.3 (TRD): The corruption schedule in SRR is presented as sufficient to prevent drift under fast ego-motion and scene changes, but no ablation varies the schedule parameters or measures actual per-step error accumulation in AR mode to confirm the schedule matches real rollout dynamics on nuScenes.
minor comments (2)
  1. [Figure 2] Figure 2 (framework diagram): The distinction between SRR training phase and TRD inference phase could be clarified with explicit arrows or labels indicating which components are frozen during student training.
  2. [§2] §2 (Related Work): The discussion of prior AR distillation methods could include a brief quantitative comparison table of their reported horizon lengths and memory costs to better position the bounded-memory claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and recommendation for major revision. We address each major comment point by point below. Where the manuscript lacks direct verification or details, we will incorporate the requested additions and clarifications in the revised version.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (SRR description): The load-bearing claim that SRR yields a teacher whose own long AR rollouts remain distributionally close to GT (enabling reliable TRD supervision) is not directly tested. No metrics are reported for the teacher's standalone minute-scale AR generations (e.g., FID/FVD/ARE on teacher-only rollouts vs. ground truth), leaving the extrapolation from scheduled short-clip recovery to unbounded AR unverified.

    Authors: We agree that direct quantitative metrics on the teacher's standalone long-horizon AR rollouts would provide stronger verification of the stability claim. The current manuscript supports the claim through the SRR objective and the resulting student gains, but does not report teacher-only FID/FVD/ARE/DTW on minute-scale generations. We will add these metrics in the revised manuscript, including comparisons against ground truth and relevant baselines. revision: yes

  2. Referee: [§4] §4 (Experiments): The nuScenes results compare against long-horizon baselines, but the manuscript omits full details on training schedules, exact train/val splits, whether baselines received equivalent data augmentation, and whether teacher rollouts were used only for training or also in evaluation. This weakens the support for the reported 52%/37% gains being attributable to the proposed method rather than implementation differences.

    Authors: We acknowledge the need for these details to support reproducibility and attribution. The revised manuscript will include complete training schedules, the exact nuScenes train/validation splits, confirmation that all baselines used identical data augmentations, and explicit clarification that teacher rollouts are employed only for student training via TRD and not for evaluation. revision: yes

  3. Referee: [§3.3] §3.3 (TRD): The corruption schedule in SRR is presented as sufficient to prevent drift under fast ego-motion and scene changes, but no ablation varies the schedule parameters or measures actual per-step error accumulation in AR mode to confirm the schedule matches real rollout dynamics on nuScenes.

    Authors: The schedule parameters were chosen based on preliminary AR error observations on nuScenes driving scenes. To strengthen this, the revised version will include an ablation varying key schedule parameters (e.g., corruption probability and recovery frequency) along with per-step error accumulation curves in AR mode, demonstrating alignment with actual rollout dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in HorizonDrive derivation

full rationale

The paper introduces SRR (scheduled rollout recovery) as training the base model to reconstruct ground-truth clips from prediction-corrupted histories and TRD (teacher rollout DMD) as student alignment to the resulting teacher rollouts. These procedures are defined independently of the target long-horizon metrics and evaluated on external nuScenes benchmarks via FID, FVD, ARE, and DTW reductions. No equations, self-citations, or ansatzes reduce the stability claim or performance gains to fitted inputs by construction; the methods remain falsifiable through the described experiments and do not rely on load-bearing prior author results or renaming of known patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that SRR training yields stable long-rollout teachers; no free parameters or invented entities are explicitly quantified in the abstract.

free parameters (1)
  • SRR schedule
    The timing and corruption schedule for recovery training is a design choice that must be tuned.
axioms (1)
  • domain assumption A model trained to reconstruct ground-truth from prediction-corrupted histories will remain stable under its own autoregressive predictions.
    Invoked as the key insight enabling unbounded-horizon teacher supervision.

pith-pipeline@v0.9.0 · 5632 in / 1111 out tokens · 59481 ms · 2026-05-13T01:12:39.233806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 13 internal anchors

  1. [1]

    Building Normalizing Flows with Stochastic Interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571,

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

  3. [3]

    Mixture of contexts for long video generation

    Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

  4. [4]

    Self- forcing++: Towards minute-scale high-quality video genera- tion.arXiv preprint arXiv:2510.02283, 2025

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283,

  5. [5]

    Magicdrive3d: Controllable 3d genera- tion for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024

    Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive3d: Controllable 3d generation for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024a. Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive ...

  6. [6]

    Mean Flows for One-step Generative Modeling

    10 Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447,

  7. [7]

    arXiv preprint arXiv:2512.15702 (2025)

    Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702,

  8. [8]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gian- luca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

  9. [9]

    In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024

    Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a. Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint...

  10. [10]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

  11. [11]

    Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212,

    Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212,

  12. [12]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  13. [13]

    arXiv preprint arXiv:2511.22677 (2025) 4, 5

    Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, et al. Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677, 2025a. Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion i...

  14. [14]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, February 2025

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248,

  15. [15]

    Dreamforge: Motion-aware autoregressive video generation for multi-view driving scenes

    Jianbiao Mei, Tao Hu, Xuemeng Yang, Licheng Wen, Yu Yang, Tiantian Wei, Yukai Ma, Min Dou, Botian Shi, and Yong Liu. Dreamforge: Motion-aware autoregressive video generation for multi-view driving scenes. arXiv preprint arXiv:2409.04003,

  16. [16]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,

  17. [17]

    Available: https://arxiv.org/abs/2506.09042

    URLhttps://arxiv.org/abs/2506.09042. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695,

  18. [18]

    GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

  19. [19]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614,

  20. [20]

    Longcat-video technical report

    Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200,

  21. [21]

    Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Han- lin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540,

  22. [22]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  23. [23]

    Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024a. Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting a...

  24. [24]

    Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139, 2026

    Tianhe Wu, Ruibin Li, Lei Zhang, and Kede Ma. Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139,

  25. [25]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622,

  26. [26]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072,

  27. [27]

    Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379,

    12 Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024a. Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesu...

  28. [28]

    Composing driving worlds through disentangled control for adversarial scenario generation.arXiv preprint arXiv:2603.12864,

    Yifan Zhan, Zhengqing Chen, Qingjie Wang, Zhuo He, Muyao Niu, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, and Yinqiang Zheng. Composing driving worlds through disentangled control for adversarial scenario generation.arXiv preprint arXiv:2603.12864,

  29. [29]

    arXiv preprint arXiv:2506.24113 (2025) 2, 4

    Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. arXiv preprint arXiv:2506.24113,

  30. [30]

    Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

    Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency.arXiv preprint arXiv:2510.08431,

  31. [31]

    Open-sora: Democratizing efficient video production for all, 2024.URL https://github

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, 2024.URL https://github. com/hpcaitech/Open-Sora,

  32. [32]

    Causal forcing: Autoregressivediffusiondistillationdonerightforhigh-qualityreal-timeinteractivevideogeneration

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214,

  33. [33]

    13 A Implementation details Backbone and V AE.HorizonDrive is built on Wan 2.1 1.3B [Wan et al., 2025] with full bidirectional attention, and adopts the disentangled driving-control modules described in Sec. 4.1. Since driving scenes involve fast ego-motion and rapidly changing fine details, we fine-tune the original V AE to reduce its temporal compressio...

  34. [34]

    Longer rollouts of up to one minute on a self-collected dataset are demonstrated qualitatively in Sec

    Datasets and evaluation horizon.For nuScenes [Caesar et al., 2020], we use 700 multi-view videos of ∼20 seconds for training and 150 for validation; the per-clip length of ∼20 s is the dataset upper bound, which determines the horizon of our quantitative evaluation. Longer rollouts of up to one minute on a self-collected dataset are demonstrated qualitati...

  35. [35]

    SRR (G roll) Optimizer AdamW AdamW Learning rate 1e-5 1e-5 Weight decay 1e-2 1e-5 Global batch size 96 64 Mixed precision bf16 bf16 Training steps 40K (proprietary) + 10K (nuScenes) 10K (nuScenes) GPU Usage 96 NVIDIA 5090 64 NVIDIA 5090 Context window lengthT11 11 Chunk sizeK10, 40 10, 40 Resolution [256, 512], [384, 768] [256, 512], [384, 768] AR rollout...

  36. [36]

    CFG scaleα6 C Baseline evaluation protocols Group (i): long-horizon interactive world model frameworks.These methods are designed for general open-domain or interactive world simulation and do not natively accept our driving control signals (actions, HD maps, bounding boxes). For each method, we use the publicly released checkpoint, condition it only on t...