pith. machine review for the scientific record. sign in

arxiv: 2605.15178 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: no theorem link

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords world modelvideo generationdiffusion transformercamera controllong videohybrid attentionefficient inference
0
0 comments X

The pith

SANA-WM generates minute-scale 720p videos with camera control at 36 times higher throughput than prior open-source models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SANA-WM, a 2.6 billion parameter open-source world model for synthesizing high-fidelity one-minute videos at 720p resolution with accurate camera movements. It relies on four key designs including hybrid linear attention for efficient long-sequence processing, dual-branch camera control, a two-stage generation pipeline, and a robust annotation method to extract precise poses from public videos. Trained on roughly 213,000 video clips, the model completes training in 15 days using 64 H100 GPUs and can produce a full minute clip on a single GPU, with a distilled version running on consumer hardware. This setup delivers stronger action-following accuracy than earlier open-source approaches while matching the visual quality of much larger industrial systems.

Core claim

SANA-WM is an efficient 2.6B-parameter world model that uses hybrid linear attention, dual-branch camera control, two-stage refinement, and public-video pose annotation to generate high-quality minute-scale 720p videos with precise 6-DoF control, achieving comparable visual quality to industrial baselines at 36x higher throughput.

What carries the argument

Hybrid Linear Attention combining frame-wise Gated DeltaNet with softmax attention to enable memory-efficient modeling of long video contexts.

If this is right

  • Scalable generation of minute-long videos becomes feasible on limited hardware resources.
  • Precise camera trajectory control supports applications requiring accurate motion simulation.
  • Training world models requires fewer video clips and less compute time than previous approaches.
  • Distilled models enable real-time or near-real-time inference on consumer GPUs.
  • Improved action-following accuracy in extended video sequences compared to prior open-source baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such efficiency could make advanced world modeling accessible to smaller research teams without access to large compute clusters.
  • The annotation pipeline for metric-scale poses might improve training for other video-based AI tasks if applied more broadly.
  • Hybrid attention designs may transfer to other domains needing long-context sequence modeling like audio or text.
  • Future extensions could integrate this model into interactive environments for robotics planning or virtual reality.

Load-bearing premise

That extracting accurate metric-scale 6-DoF camera poses from public videos provides sufficiently consistent and high-quality action labels for effective world model training.

What would settle it

Generating a set of one-minute videos with complex camera paths using SANA-WM and measuring if the action-following accuracy falls below that of prior open-source baselines on the same benchmark.

read the original abstract

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SANA-WM, a 2.6B-parameter open-source world model for synthesizing high-fidelity 720p minute-scale videos with precise 6-DoF camera control. It relies on four designs: Hybrid Linear Attention (frame-wise Gated DeltaNet combined with softmax), Dual-Branch Camera Control, a Two-Stage Generation Pipeline with a long-video refiner, and a Robust Annotation Pipeline that extracts metric-scale 6-DoF poses from ~213K public clips. The model is trained in 15 days on 64 H100s, generates 60s clips on a single GPU (or 34s on RTX 5090 with quantization), and claims visual quality comparable to LingBot-World and HY-WorldPlay, stronger action-following accuracy than open-source baselines, and 36x higher throughput on a one-minute world-model benchmark.

Significance. If the empirical results and pipeline validation hold, the work would be significant for demonstrating that minute-scale, controllable video world models can be trained efficiently with public data and modest resources, offering an open-source alternative to industrial systems. The hybrid linear attention and two-stage refinement could inform scalable long-context video architectures, while the efficiency numbers (training time, inference speed) would highlight practical advances in deployment.

major comments (3)
  1. [§3.4] §3.4 (Robust Annotation Pipeline): The central claim of precise 6-DoF control and superior action-following rests on the pipeline producing high-quality metric-scale labels, yet no quantitative validation is supplied (e.g., absolute trajectory error, scale-drift metrics, or comparison to ground-truth rigs on held-out sequences). This is load-bearing for the benchmark results.
  2. [§4] §4 (Experiments and Benchmarks): The abstract and results assert 36× throughput, stronger action-following accuracy, and visual quality comparable to LingBot-World/HY-WorldPlay, but no tables, error bars, ablation studies, or detailed evaluation protocols (dataset splits, metrics, baselines) are provided, preventing assessment of the claims.
  3. [§3.2] §3.2 (Dual-Branch Camera Control): The mechanism for ensuring 6-DoF trajectory adherence is described at a high level but lacks equations or pseudocode showing how the branches interact with the hybrid attention to enforce consistency over 60-second sequences.
minor comments (2)
  1. [Abstract] Abstract contains a typographical error: 'SANA-WMdemonstrates' should be 'SANA-WM demonstrates'.
  2. [§3.1] Notation for the Gated DeltaNet (GDN) component is introduced without a clear equation reference or comparison to prior linear-attention variants.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional quantitative validation, detailed experimental protocols, and mathematical specifications as outlined.

read point-by-point responses
  1. Referee: [§3.4] §3.4 (Robust Annotation Pipeline): The central claim of precise 6-DoF control and superior action-following rests on the pipeline producing high-quality metric-scale labels, yet no quantitative validation is supplied (e.g., absolute trajectory error, scale-drift metrics, or comparison to ground-truth rigs on held-out sequences). This is load-bearing for the benchmark results.

    Authors: We agree that quantitative validation is essential to substantiate the annotation pipeline's quality. In the revised manuscript we will add a dedicated evaluation subsection reporting absolute trajectory error (ATE), relative pose error, and scale-drift statistics on a held-out subset of 500 clips. Where possible we will compare against ORB-SLAM3 and COLMAP reconstructions on the same sequences; for clips lacking external rigs we will include consistency metrics across overlapping segments and results from a synthetic validation set generated with known ground-truth trajectories. These additions will directly support the action-following claims. revision: yes

  2. Referee: [§4] §4 (Experiments and Benchmarks): The abstract and results assert 36× throughput, stronger action-following accuracy, and visual quality comparable to LingBot-World/HY-WorldPlay, but no tables, error bars, ablation studies, or detailed evaluation protocols (dataset splits, metrics, baselines) are provided, preventing assessment of the claims.

    Authors: We acknowledge that the current draft lacks sufficient tabular results and protocol details. The revised Section 4 will include: (1) full quantitative tables with mean and standard deviation (error bars) over three independent runs for all reported metrics; (2) ablation studies isolating each proposed component (hybrid attention, dual-branch control, two-stage refinement); (3) explicit dataset splits, metric definitions (e.g., action-following accuracy as average 6-DoF pose error over 60 s), baseline re-implementation details, and the exact one-minute world-model benchmark protocol. Throughput numbers will be reported with hardware specifications and batch-size settings. revision: yes

  3. Referee: [§3.2] §3.2 (Dual-Branch Camera Control): The mechanism for ensuring 6-DoF trajectory adherence is described at a high level but lacks equations or pseudocode showing how the branches interact with the hybrid attention to enforce consistency over 60-second sequences.

    Authors: We will expand Section 3.2 with the requested mathematical formulation and pseudocode. The revised text will define the dual-branch architecture via equations: the camera branch produces pose-conditioned embeddings that are injected into the hybrid linear attention layers through cross-attention; the interaction is formalized as a gated fusion where attention weights are modulated by the cumulative 6-DoF trajectory. A consistency regularizer over long sequences will be stated explicitly. Algorithm 1 (pseudocode) will illustrate the forward pass for a 60-second clip, showing how frame-wise Gated DeltaNet and softmax branches receive the camera signal at each step. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on training outcomes, not self-referential derivations

full rationale

The paper describes an empirical architecture (Hybrid Linear Attention, Dual-Branch Camera Control, Two-Stage Pipeline, Robust Annotation Pipeline) trained on ~213K public clips and evaluated on a one-minute benchmark. Reported metrics (action-following accuracy, visual quality, throughput) are presented as measured results of training and inference rather than quantities derived from equations that reduce to the inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the annotation pipeline is an external data-processing step whose quality is assumed but not mathematically forced into the outputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the validity of the hybrid attention design for long-context video and the accuracy of the pose extraction method from public videos; these are presented as engineering solutions rather than derived from first principles.

free parameters (1)
  • 2.6B parameter count
    Chosen architectural scale to balance performance and efficiency; no fitted value given.
axioms (2)
  • domain assumption Hybrid linear attention can model long video contexts efficiently without significant quality loss compared to full attention
    Central to the architecture's claimed efficiency and long-sequence capability.
  • domain assumption Public videos contain extractable metric-scale 6-DoF poses suitable for high-quality supervision
    Basis for the robust annotation pipeline and training data quality.

pith-pipeline@v0.9.0 · 5619 in / 1525 out tokens · 66608 ms · 2026-05-15T13:52:18.797742+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 40 internal anchors

  1. [1]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  2. [2]

    Genie 3: A new frontier for world models

    Jack Parker-Holder and Shlomi Fruchter. Genie 3: A new frontier for world models. https://deepmind.google/en/ blog/genie-3-a-new-frontier-for-world-models/, 2025. Google DeepMind blog post, August 2025

  3. [3]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  4. [4]

    Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  5. [5]

    Aether: Geometric-aware unified world modeling

    Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8535–8546, 2025

  6. [6]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614, 2025

  7. [7]

    Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

  8. [8]

    Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory

    Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory. arXiv preprint arXiv:2602.02393, 2026

  9. [9]

    Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

  10. [10]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

  11. [11]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

  12. [12]

    Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

    Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, and Jianfei Cai. Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

  13. [13]

    Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

  14. [14]

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.𝜋3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

  15. [15]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025

  16. [16]

    Introducing Nano Banana Pro

    Google Blog. Introducing Nano Banana Pro. https://blog.google/innovation-and-ai/products/nano- banana-pro/, 2025

  17. [17]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

  18. [18]

    Video generation models as world simulators

    OpenAI. Video generation models as world simulators. https://openai.com/index/video-generation- models-as-world-simulators/, 2024. Technical report, February 2024

  19. [19]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 21 SANA-WM : Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

  20. [20]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  21. [21]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  22. [22]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  23. [23]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  24. [24]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  25. [25]

    Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

    Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

  26. [26]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  27. [27]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  28. [28]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  29. [29]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  30. [30]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  31. [31]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023

  32. [32]

    Revisiting feature prediction for learning visual representations from video, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024

  33. [33]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

  34. [34]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  35. [35]

    Genie 2: A large-scale foundation world model

    Google DeepMind. Genie 2: A large-scale foundation world model. https://deepmind.google/blog/genie-2- a-large-scale-foundation-world-model/, 2024. Blog post

  36. [36]

    Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

  37. [37]

    Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024

    Decart and Etched. Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024. Technical report

  38. [38]

    Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

    Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. arXiv preprint arXiv:2411.00769, 2024

  39. [39]

    Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

  40. [40]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

  41. [41]

    Live: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026

    Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, and Li Jiang. Live: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026. 22 SANA-WM : Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

  42. [42]

    Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025

    Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025

  43. [43]

    Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025

    Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, and Peng-Tao Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025

  44. [44]

    Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

    Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

  45. [45]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  46. [46]

    Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

    Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

  47. [47]

    Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871, 2026

    Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, et al. Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871, 2026

  48. [48]

    Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models.arXiv preprint arXiv:2602.22960, 2026

    Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, and Song-Hai Zhang. Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models.arXiv preprint arXiv:2602.22960, 2026

  49. [49]

    Captain safari: A world engine with pose-aligned 3d memory, 2026

    Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, and Junfei Xiao. Captain safari: A world engine with pose-aligned 3d memory, 2026

  50. [50]

    Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

    Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

  51. [51]

    Drivedreamer: towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023

  52. [52]

    Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

    Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024

  53. [53]

    Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

    Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

  54. [54]

    Memorize when needed: Decoupled memory control for spatially consistent long-horizon video generation, 2026

    Yanjun Guo, Zhengqiang Zhang, Pengfei Wang, Xinyue Liang, Zhiyuan Ma, and Lei Zhang. Memorize when needed: Decoupled memory control for spatially consistent long-horizon video generation, 2026

  55. [55]

    Vggt-world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

    Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, and Yadan Luo. Vggt-world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

  56. [56]

    Versecrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026

    Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Versecrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026

  57. [57]

    HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

    Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, et al. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds.arXiv preprint arXiv:2604.14268, 2026

  58. [58]

    INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

  59. [59]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  60. [60]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  61. [61]

    Camco: Camera-controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024

    Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera-controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024. 23 SANA-WM : Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

  62. [62]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

  63. [63]

    Stable virtual camera: Generative view synthesis with diffusion models

    Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025

  64. [64]

    Light field networks: Neural scene representations with single-evaluation rendering.Advances in Neural Information Processing Systems, 34:19313–19325, 2021

    Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.Advances in Neural Information Processing Systems, 34:19313–19325, 2021

  65. [65]

    Cameras as relative positional encoding

    Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025

  66. [66]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  67. [67]

    Wint3r: Window-based streaming reconstruction with camera token pool

    Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool.arXiv preprint arXiv:2509.05296, 2025

  68. [68]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  69. [69]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  70. [70]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020

  71. [71]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

  72. [72]

    Rwkv: Reinventing rnns for the transformer era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023

  73. [73]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  74. [74]

    Hyena hierarchy: Towards larger convolutional language models

    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. InInternational Conference on Machine Learning, pages 28043–28078. PMLR, 2023

  75. [75]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  76. [76]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

  77. [77]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

  78. [78]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  79. [79]

    Qwen3-next: Hybrid attention with gated deltanet

    Qwen Team. Qwen3-next: Hybrid attention with gated deltanet. https://huggingface.co/collections/Qwen/ qwen3-next, 2025. Model collections

  80. [80]

    Kimi Linear: An Expressive, Efficient Attention Architecture

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

Showing first 80 references.