arxiv: 2605.15178 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: no theorem link

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu , Haozhe Liu , Yuyang Zhao , Tian Ye , Junsong Chen , Jincheng Yu , Tong He , Song Han

show 1 more author

Enze Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords world modelvideo generationdiffusion transformercamera controllong videohybrid attentionefficient inference

0 comments

The pith

SANA-WM generates minute-scale 720p videos with camera control at 36 times higher throughput than prior open-source models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SANA-WM, a 2.6 billion parameter open-source world model for synthesizing high-fidelity one-minute videos at 720p resolution with accurate camera movements. It relies on four key designs including hybrid linear attention for efficient long-sequence processing, dual-branch camera control, a two-stage generation pipeline, and a robust annotation method to extract precise poses from public videos. Trained on roughly 213,000 video clips, the model completes training in 15 days using 64 H100 GPUs and can produce a full minute clip on a single GPU, with a distilled version running on consumer hardware. This setup delivers stronger action-following accuracy than earlier open-source approaches while matching the visual quality of much larger industrial systems.

Core claim

SANA-WM is an efficient 2.6B-parameter world model that uses hybrid linear attention, dual-branch camera control, two-stage refinement, and public-video pose annotation to generate high-quality minute-scale 720p videos with precise 6-DoF control, achieving comparable visual quality to industrial baselines at 36x higher throughput.

What carries the argument

Hybrid Linear Attention combining frame-wise Gated DeltaNet with softmax attention to enable memory-efficient modeling of long video contexts.

If this is right

Scalable generation of minute-long videos becomes feasible on limited hardware resources.
Precise camera trajectory control supports applications requiring accurate motion simulation.
Training world models requires fewer video clips and less compute time than previous approaches.
Distilled models enable real-time or near-real-time inference on consumer GPUs.
Improved action-following accuracy in extended video sequences compared to prior open-source baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such efficiency could make advanced world modeling accessible to smaller research teams without access to large compute clusters.
The annotation pipeline for metric-scale poses might improve training for other video-based AI tasks if applied more broadly.
Hybrid attention designs may transfer to other domains needing long-context sequence modeling like audio or text.
Future extensions could integrate this model into interactive environments for robotics planning or virtual reality.

Load-bearing premise

That extracting accurate metric-scale 6-DoF camera poses from public videos provides sufficiently consistent and high-quality action labels for effective world model training.

What would settle it

Generating a set of one-minute videos with complex camera paths using SANA-WM and measuring if the action-following accuracy falls below that of prior open-source baselines on the same benchmark.

read the original abstract

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SANA-WM delivers usable efficiency for open minute-scale video world models but rests on unvalidated metric-scale pose labels from public videos.

read the letter

SANA-WM is a 2.6B open-source model that generates one-minute 720p videos with camera control after training on roughly 213K public clips in 15 days on 64 H100s. The hybrid linear attention that mixes Gated DeltaNet with softmax, the dual-branch camera control, the two-stage refiner, and the annotation pipeline are the concrete pieces they added. These let them report stronger action-following than prior open baselines and visual quality close to industrial systems at 36 times higher throughput, with inference down to a single GPU or even 34 seconds on an RTX 5090 after distillation. The efficiency numbers and the decision to release the model are the parts that stand out as useful right now. People who need runnable long-video generators for robotics or simulation work will find the training and inference details practical. The soft spot is the annotation pipeline. Everything about precise 6-DoF control and reliable action labels depends on extracting accurate metric-scale poses from monocular public videos, yet the abstract supplies no trajectory error numbers, scale consistency checks, or ground-truth comparisons. Monocular pose estimation commonly has drift and scale problems, so the training signal quality remains an assumption rather than a demonstrated result. The abstract also skips ablations and full quantitative tables, which makes it difficult to judge how much each design choice actually moves the needle. This paper is for groups building or using video world models who want something they can train and run without industrial-scale hardware. It has enough engineering substance and an open release to deserve referee time, even though the experiments will need more detail and validation on the pose side.

Referee Report

3 major / 2 minor

Summary. The paper introduces SANA-WM, a 2.6B-parameter open-source world model for synthesizing high-fidelity 720p minute-scale videos with precise 6-DoF camera control. It relies on four designs: Hybrid Linear Attention (frame-wise Gated DeltaNet combined with softmax), Dual-Branch Camera Control, a Two-Stage Generation Pipeline with a long-video refiner, and a Robust Annotation Pipeline that extracts metric-scale 6-DoF poses from ~213K public clips. The model is trained in 15 days on 64 H100s, generates 60s clips on a single GPU (or 34s on RTX 5090 with quantization), and claims visual quality comparable to LingBot-World and HY-WorldPlay, stronger action-following accuracy than open-source baselines, and 36x higher throughput on a one-minute world-model benchmark.

Significance. If the empirical results and pipeline validation hold, the work would be significant for demonstrating that minute-scale, controllable video world models can be trained efficiently with public data and modest resources, offering an open-source alternative to industrial systems. The hybrid linear attention and two-stage refinement could inform scalable long-context video architectures, while the efficiency numbers (training time, inference speed) would highlight practical advances in deployment.

major comments (3)

[§3.4] §3.4 (Robust Annotation Pipeline): The central claim of precise 6-DoF control and superior action-following rests on the pipeline producing high-quality metric-scale labels, yet no quantitative validation is supplied (e.g., absolute trajectory error, scale-drift metrics, or comparison to ground-truth rigs on held-out sequences). This is load-bearing for the benchmark results.
[§4] §4 (Experiments and Benchmarks): The abstract and results assert 36× throughput, stronger action-following accuracy, and visual quality comparable to LingBot-World/HY-WorldPlay, but no tables, error bars, ablation studies, or detailed evaluation protocols (dataset splits, metrics, baselines) are provided, preventing assessment of the claims.
[§3.2] §3.2 (Dual-Branch Camera Control): The mechanism for ensuring 6-DoF trajectory adherence is described at a high level but lacks equations or pseudocode showing how the branches interact with the hybrid attention to enforce consistency over 60-second sequences.

minor comments (2)

[Abstract] Abstract contains a typographical error: 'SANA-WMdemonstrates' should be 'SANA-WM demonstrates'.
[§3.1] Notation for the Gated DeltaNet (GDN) component is introduced without a clear equation reference or comparison to prior linear-attention variants.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional quantitative validation, detailed experimental protocols, and mathematical specifications as outlined.

read point-by-point responses

Referee: [§3.4] §3.4 (Robust Annotation Pipeline): The central claim of precise 6-DoF control and superior action-following rests on the pipeline producing high-quality metric-scale labels, yet no quantitative validation is supplied (e.g., absolute trajectory error, scale-drift metrics, or comparison to ground-truth rigs on held-out sequences). This is load-bearing for the benchmark results.

Authors: We agree that quantitative validation is essential to substantiate the annotation pipeline's quality. In the revised manuscript we will add a dedicated evaluation subsection reporting absolute trajectory error (ATE), relative pose error, and scale-drift statistics on a held-out subset of 500 clips. Where possible we will compare against ORB-SLAM3 and COLMAP reconstructions on the same sequences; for clips lacking external rigs we will include consistency metrics across overlapping segments and results from a synthetic validation set generated with known ground-truth trajectories. These additions will directly support the action-following claims. revision: yes
Referee: [§4] §4 (Experiments and Benchmarks): The abstract and results assert 36× throughput, stronger action-following accuracy, and visual quality comparable to LingBot-World/HY-WorldPlay, but no tables, error bars, ablation studies, or detailed evaluation protocols (dataset splits, metrics, baselines) are provided, preventing assessment of the claims.

Authors: We acknowledge that the current draft lacks sufficient tabular results and protocol details. The revised Section 4 will include: (1) full quantitative tables with mean and standard deviation (error bars) over three independent runs for all reported metrics; (2) ablation studies isolating each proposed component (hybrid attention, dual-branch control, two-stage refinement); (3) explicit dataset splits, metric definitions (e.g., action-following accuracy as average 6-DoF pose error over 60 s), baseline re-implementation details, and the exact one-minute world-model benchmark protocol. Throughput numbers will be reported with hardware specifications and batch-size settings. revision: yes
Referee: [§3.2] §3.2 (Dual-Branch Camera Control): The mechanism for ensuring 6-DoF trajectory adherence is described at a high level but lacks equations or pseudocode showing how the branches interact with the hybrid attention to enforce consistency over 60-second sequences.

Authors: We will expand Section 3.2 with the requested mathematical formulation and pseudocode. The revised text will define the dual-branch architecture via equations: the camera branch produces pose-conditioned embeddings that are injected into the hybrid linear attention layers through cross-attention; the interaction is formalized as a gated fusion where attention weights are modulated by the cumulative 6-DoF trajectory. A consistency regularizer over long sequences will be stated explicitly. Algorithm 1 (pseudocode) will illustrate the forward pass for a 60-second clip, showing how frame-wise Gated DeltaNet and softmax branches receive the camera signal at each step. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on training outcomes, not self-referential derivations

full rationale

The paper describes an empirical architecture (Hybrid Linear Attention, Dual-Branch Camera Control, Two-Stage Pipeline, Robust Annotation Pipeline) trained on ~213K public clips and evaluated on a one-minute benchmark. Reported metrics (action-following accuracy, visual quality, throughput) are presented as measured results of training and inference rather than quantities derived from equations that reduce to the inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the annotation pipeline is an external data-processing step whose quality is assumed but not mathematically forced into the outputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the validity of the hybrid attention design for long-context video and the accuracy of the pose extraction method from public videos; these are presented as engineering solutions rather than derived from first principles.

free parameters (1)

2.6B parameter count
Chosen architectural scale to balance performance and efficiency; no fitted value given.

axioms (2)

domain assumption Hybrid linear attention can model long video contexts efficiently without significant quality loss compared to full attention
Central to the architecture's claimed efficiency and long-sequence capability.
domain assumption Public videos contain extractable metric-scale 6-DoF poses suitable for high-quality supervision
Basis for the robust annotation pipeline and training data quality.

pith-pipeline@v0.9.0 · 5619 in / 1525 out tokens · 66608 ms · 2026-05-15T13:52:18.797742+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · 40 internal anchors

[1]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Genie 3: A new frontier for world models

Jack Parker-Holder and Shlomi Fruchter. Genie 3: A new frontier for world models. https://deepmind.google/en/ blog/genie-3-a-new-frontier-for-world-models/, 2025. Google DeepMind blog post, August 2025

work page 2025
[3]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

work page arXiv 2026
[5]

Aether: Geometric-aware unified world modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8535–8546, 2025

work page 2025
[6]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614, 2025

work page internal anchor Pith review arXiv 2025
[7]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

work page arXiv 2026
[8]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory. arXiv preprint arXiv:2602.02393, 2026

work page arXiv 2026
[9]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, and Jianfei Cai. Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025

work page arXiv 2025
[13]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

work page arXiv 2025
[14]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.𝜋3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Introducing Nano Banana Pro

Google Blog. Introducing Nano Banana Pro. https://blog.google/innovation-and-ai/products/nano- banana-pro/, 2025

work page 2025
[17]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. https://openai.com/index/video-generation- models-as-world-simulators/, 2024. Technical report, February 2024

work page 2024
[19]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 21 SANA-WM : Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

work page arXiv 2025
[26]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023

work page 2023
[32]

Revisiting feature prediction for learning visual representations from video, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024

work page 2024
[33]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

work page 2024
[35]

Genie 2: A large-scale foundation world model

Google DeepMind. Genie 2: A large-scale foundation world model. https://deepmind.google/blog/genie-2- a-large-scale-foundation-world-model/, 2024. Blog post

work page 2024
[36]

Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

work page arXiv 2024
[37]

Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024

Decart and Etched. Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024. Technical report

work page 2024
[38]

Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. arXiv preprint arXiv:2411.00769, 2024

work page arXiv 2024
[39]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

work page arXiv 2025
[40]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

work page arXiv 2025
[41]

Live: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026

Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, and Li Jiang. Live: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026. 22 SANA-WM : Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

work page arXiv 2026
[42]

Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025

Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025

work page arXiv 2025
[43]

Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025

Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, and Peng-Tao Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025

work page arXiv 2025
[44]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

work page arXiv 2025
[45]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

work page arXiv 2023
[47]

Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871, 2026

Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, et al. Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871, 2026

work page arXiv 2026
[48]

Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models.arXiv preprint arXiv:2602.22960, 2026

Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, and Song-Hai Zhang. Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models.arXiv preprint arXiv:2602.22960, 2026

work page arXiv 2026
[49]

Captain safari: A world engine with pose-aligned 3d memory, 2026

Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, and Junfei Xiao. Captain safari: A world engine with pose-aligned 3d memory, 2026

work page 2026
[50]

Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

work page arXiv 2025
[51]

Drivedreamer: towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023

work page arXiv 2023
[52]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024

work page 2024
[53]

Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

work page arXiv 2026
[54]

Memorize when needed: Decoupled memory control for spatially consistent long-horizon video generation, 2026

Yanjun Guo, Zhengqiang Zhang, Pengfei Wang, Xinyue Liang, Zhiyuan Ma, and Lei Zhang. Memorize when needed: Decoupled memory control for spatially consistent long-horizon video generation, 2026

work page 2026
[55]

Vggt-world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, and Yadan Luo. Vggt-world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026

work page arXiv 2026
[56]

Versecrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026

Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Versecrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026

work page arXiv 2026
[57]

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, et al. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds.arXiv preprint arXiv:2604.14268, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

work page 2024
[61]

Camco: Camera-controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024

Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera-controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024. 23 SANA-WM : Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

work page arXiv 2024
[62]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Stable virtual camera: Generative view synthesis with diffusion models

Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025

work page 2025
[64]

Light field networks: Neural scene representations with single-evaluation rendering.Advances in Neural Information Processing Systems, 34:19313–19325, 2021

Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.Advances in Neural Information Processing Systems, 34:19313–19325, 2021

work page 2021
[65]

Cameras as relative positional encoding

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025

work page arXiv 2025
[66]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[67]

Wint3r: Window-based streaming reconstruction with camera token pool

Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool.arXiv preprint arXiv:2509.05296, 2025

work page arXiv 2025
[68]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020
[70]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[71]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Rwkv: Reinventing rnns for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023

work page 2023
[73]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Hyena hierarchy: Towards larger convolutional language models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. InInternational Conference on Machine Learning, pages 28043–28078. PMLR, 2023

work page 2023
[75]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Qwen3-next: Hybrid attention with gated deltanet

Qwen Team. Qwen3-next: Hybrid attention with gated deltanet. https://huggingface.co/collections/Qwen/ qwen3-next, 2025. Model collections

work page 2025
[80]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.