Recognition: no theorem link
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
Pith reviewed 2026-05-15 13:52 UTC · model grok-4.3
The pith
SANA-WM generates minute-scale 720p videos with camera control at 36 times higher throughput than prior open-source models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SANA-WM is an efficient 2.6B-parameter world model that uses hybrid linear attention, dual-branch camera control, two-stage refinement, and public-video pose annotation to generate high-quality minute-scale 720p videos with precise 6-DoF control, achieving comparable visual quality to industrial baselines at 36x higher throughput.
What carries the argument
Hybrid Linear Attention combining frame-wise Gated DeltaNet with softmax attention to enable memory-efficient modeling of long video contexts.
If this is right
- Scalable generation of minute-long videos becomes feasible on limited hardware resources.
- Precise camera trajectory control supports applications requiring accurate motion simulation.
- Training world models requires fewer video clips and less compute time than previous approaches.
- Distilled models enable real-time or near-real-time inference on consumer GPUs.
- Improved action-following accuracy in extended video sequences compared to prior open-source baselines.
Where Pith is reading between the lines
- Such efficiency could make advanced world modeling accessible to smaller research teams without access to large compute clusters.
- The annotation pipeline for metric-scale poses might improve training for other video-based AI tasks if applied more broadly.
- Hybrid attention designs may transfer to other domains needing long-context sequence modeling like audio or text.
- Future extensions could integrate this model into interactive environments for robotics planning or virtual reality.
Load-bearing premise
That extracting accurate metric-scale 6-DoF camera poses from public videos provides sufficiently consistent and high-quality action labels for effective world model training.
What would settle it
Generating a set of one-minute videos with complex camera paths using SANA-WM and measuring if the action-following accuracy falls below that of prior open-source baselines on the same benchmark.
read the original abstract
We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SANA-WM, a 2.6B-parameter open-source world model for synthesizing high-fidelity 720p minute-scale videos with precise 6-DoF camera control. It relies on four designs: Hybrid Linear Attention (frame-wise Gated DeltaNet combined with softmax), Dual-Branch Camera Control, a Two-Stage Generation Pipeline with a long-video refiner, and a Robust Annotation Pipeline that extracts metric-scale 6-DoF poses from ~213K public clips. The model is trained in 15 days on 64 H100s, generates 60s clips on a single GPU (or 34s on RTX 5090 with quantization), and claims visual quality comparable to LingBot-World and HY-WorldPlay, stronger action-following accuracy than open-source baselines, and 36x higher throughput on a one-minute world-model benchmark.
Significance. If the empirical results and pipeline validation hold, the work would be significant for demonstrating that minute-scale, controllable video world models can be trained efficiently with public data and modest resources, offering an open-source alternative to industrial systems. The hybrid linear attention and two-stage refinement could inform scalable long-context video architectures, while the efficiency numbers (training time, inference speed) would highlight practical advances in deployment.
major comments (3)
- [§3.4] §3.4 (Robust Annotation Pipeline): The central claim of precise 6-DoF control and superior action-following rests on the pipeline producing high-quality metric-scale labels, yet no quantitative validation is supplied (e.g., absolute trajectory error, scale-drift metrics, or comparison to ground-truth rigs on held-out sequences). This is load-bearing for the benchmark results.
- [§4] §4 (Experiments and Benchmarks): The abstract and results assert 36× throughput, stronger action-following accuracy, and visual quality comparable to LingBot-World/HY-WorldPlay, but no tables, error bars, ablation studies, or detailed evaluation protocols (dataset splits, metrics, baselines) are provided, preventing assessment of the claims.
- [§3.2] §3.2 (Dual-Branch Camera Control): The mechanism for ensuring 6-DoF trajectory adherence is described at a high level but lacks equations or pseudocode showing how the branches interact with the hybrid attention to enforce consistency over 60-second sequences.
minor comments (2)
- [Abstract] Abstract contains a typographical error: 'SANA-WMdemonstrates' should be 'SANA-WM demonstrates'.
- [§3.1] Notation for the Gated DeltaNet (GDN) component is introduced without a clear equation reference or comparison to prior linear-attention variants.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional quantitative validation, detailed experimental protocols, and mathematical specifications as outlined.
read point-by-point responses
-
Referee: [§3.4] §3.4 (Robust Annotation Pipeline): The central claim of precise 6-DoF control and superior action-following rests on the pipeline producing high-quality metric-scale labels, yet no quantitative validation is supplied (e.g., absolute trajectory error, scale-drift metrics, or comparison to ground-truth rigs on held-out sequences). This is load-bearing for the benchmark results.
Authors: We agree that quantitative validation is essential to substantiate the annotation pipeline's quality. In the revised manuscript we will add a dedicated evaluation subsection reporting absolute trajectory error (ATE), relative pose error, and scale-drift statistics on a held-out subset of 500 clips. Where possible we will compare against ORB-SLAM3 and COLMAP reconstructions on the same sequences; for clips lacking external rigs we will include consistency metrics across overlapping segments and results from a synthetic validation set generated with known ground-truth trajectories. These additions will directly support the action-following claims. revision: yes
-
Referee: [§4] §4 (Experiments and Benchmarks): The abstract and results assert 36× throughput, stronger action-following accuracy, and visual quality comparable to LingBot-World/HY-WorldPlay, but no tables, error bars, ablation studies, or detailed evaluation protocols (dataset splits, metrics, baselines) are provided, preventing assessment of the claims.
Authors: We acknowledge that the current draft lacks sufficient tabular results and protocol details. The revised Section 4 will include: (1) full quantitative tables with mean and standard deviation (error bars) over three independent runs for all reported metrics; (2) ablation studies isolating each proposed component (hybrid attention, dual-branch control, two-stage refinement); (3) explicit dataset splits, metric definitions (e.g., action-following accuracy as average 6-DoF pose error over 60 s), baseline re-implementation details, and the exact one-minute world-model benchmark protocol. Throughput numbers will be reported with hardware specifications and batch-size settings. revision: yes
-
Referee: [§3.2] §3.2 (Dual-Branch Camera Control): The mechanism for ensuring 6-DoF trajectory adherence is described at a high level but lacks equations or pseudocode showing how the branches interact with the hybrid attention to enforce consistency over 60-second sequences.
Authors: We will expand Section 3.2 with the requested mathematical formulation and pseudocode. The revised text will define the dual-branch architecture via equations: the camera branch produces pose-conditioned embeddings that are injected into the hybrid linear attention layers through cross-attention; the interaction is formalized as a gated fusion where attention weights are modulated by the cumulative 6-DoF trajectory. A consistency regularizer over long sequences will be stated explicitly. Algorithm 1 (pseudocode) will illustrate the forward pass for a 60-second clip, showing how frame-wise Gated DeltaNet and softmax branches receive the camera signal at each step. revision: yes
Circularity Check
No circularity: empirical claims rest on training outcomes, not self-referential derivations
full rationale
The paper describes an empirical architecture (Hybrid Linear Attention, Dual-Branch Camera Control, Two-Stage Pipeline, Robust Annotation Pipeline) trained on ~213K public clips and evaluated on a one-minute benchmark. Reported metrics (action-following accuracy, visual quality, throughput) are presented as measured results of training and inference rather than quantities derived from equations that reduce to the inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the annotation pipeline is an external data-processing step whose quality is assumed but not mathematically forced into the outputs. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- 2.6B parameter count
axioms (2)
- domain assumption Hybrid linear attention can model long video contexts efficiently without significant quality loss compared to full attention
- domain assumption Public videos contain extractable metric-scale 6-DoF poses suitable for high-quality supervision
Reference graph
Works this paper leans on
-
[1]
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Genie 3: A new frontier for world models
Jack Parker-Holder and Shlomi Fruchter. Genie 3: A new frontier for world models. https://deepmind.google/en/ blog/genie-3-a-new-frontier-for-world-models/, 2025. Google DeepMind blog post, August 2025
work page 2025
-
[3]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026
-
[5]
Aether: Geometric-aware unified world modeling
Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8535–8546, 2025
work page 2025
-
[6]
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614, 2025
work page internal anchor Pith review arXiv 2025
-
[7]
Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
-
[8]
Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory. arXiv preprint arXiv:2602.02393, 2026
-
[9]
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, and Jianfei Cai. Unified camera positional encoding for controlled video generation.arXiv preprint arXiv:2512.07237, 2025
-
[13]
Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025
-
[14]
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.𝜋3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Google Blog. Introducing Nano Banana Pro. https://blog.google/innovation-and-ai/products/nano- banana-pro/, 2025
work page 2025
-
[17]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Video generation models as world simulators
OpenAI. Video generation models as world simulators. https://openai.com/index/video-generation- models-as-world-simulators/, 2024. Technical report, February 2024
work page 2024
-
[19]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 21 SANA-WM : Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025
-
[26]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
MAGI-1: Autoregressive Video Generation at Scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023
work page 2023
-
[32]
Revisiting feature prediction for learning visual representations from video, 2024
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024
work page 2024
-
[33]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[35]
Genie 2: A large-scale foundation world model
Google DeepMind. Genie 2: A large-scale foundation world model. https://deepmind.google/blog/genie-2- a-large-scale-foundation-world-model/, 2024. Blog post
work page 2024
-
[36]
Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024
-
[37]
Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024
Decart and Etched. Oasis: A universe in a transformer.https://oasis-model.github.io/, 2024. Technical report
work page 2024
-
[38]
Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024
Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. arXiv preprint arXiv:2411.00769, 2024
-
[39]
Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
-
[40]
Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025
Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025
-
[41]
Live: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026
Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, and Li Jiang. Live: Long-horizon interactive video world modeling.arXiv preprint arXiv:2602.03747, 2026. 22 SANA-WM : Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
-
[42]
Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, and Jiwen Lu. Astra: General interactive world model with autoregressive denoising.arXiv preprint arXiv:2512.08931, 2025
-
[43]
Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, and Peng-Tao Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025
-
[44]
Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025
Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025
-
[45]
World Action Models are Zero-shot Policies
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023
Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023
-
[47]
Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, et al. Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871, 2026
-
[48]
Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, and Song-Hai Zhang. Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models.arXiv preprint arXiv:2602.22960, 2026
-
[49]
Captain safari: A world engine with pose-aligned 3d memory, 2026
Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, and Junfei Xiao. Captain safari: A world engine with pose-aligned 3d memory, 2026
work page 2026
-
[50]
Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025
Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025
-
[51]
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023
-
[52]
Driveworld: 4d pre-trained scene understanding via world models for autonomous driving
Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024
work page 2024
-
[53]
Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026
-
[54]
Yanjun Guo, Zhengqiang Zhang, Pengfei Wang, Xinyue Liang, Zhiyuan Ma, and Lei Zhang. Memorize when needed: Decoupled memory control for spatially consistent long-horizon video generation, 2026
work page 2026
-
[55]
Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, and Yadan Luo. Vggt-world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655, 2026
-
[56]
Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Versecrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026
-
[57]
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, et al. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds.arXiv preprint arXiv:2604.14268, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024
work page 2024
-
[61]
Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera-controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024. 23 SANA-WM : Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
-
[62]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Stable virtual camera: Generative view synthesis with diffusion models
Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025
work page 2025
-
[64]
Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.Advances in Neural Information Processing Systems, 34:19313–19325, 2021
work page 2021
-
[65]
Cameras as relative positional encoding
Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025
-
[66]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025
work page 2025
-
[67]
Wint3r: Window-based streaming reconstruction with camera token pool
Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool.arXiv preprint arXiv:2509.05296, 2025
-
[68]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020
work page 2020
-
[70]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[71]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Rwkv: Reinventing rnns for the transformer era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023
work page 2023
-
[73]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Hyena hierarchy: Towards larger convolutional language models
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. InInternational Conference on Machine Learning, pages 28043–28078. PMLR, 2023
work page 2023
-
[75]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Qwen3-next: Hybrid attention with gated deltanet
Qwen Team. Qwen3-next: Hybrid attention with gated deltanet. https://huggingface.co/collections/Qwen/ qwen3-next, 2025. Model collections
work page 2025
-
[80]
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.