Kairos: A Native World Model Stack for Physical AI

Cong Wan; Dacheng Tao; Feng Lv; Kairos Team: Fei Wang; Kangkang Zhu; Pu Li; Qiming Zhang; Ruiqing Yang; Shan You; Shi Fu

arxiv: 2606.16533 · v2 · pith:ELUIGAAZnew · submitted 2026-06-15 · 💻 cs.AI · cs.CV

Kairos: A Native World Model Stack for Physical AI

Kairos Team: Fei Wang , Shan You , Qiming Zhang , Tao Huang , Zuoyi Fu , Zhisheng Zheng , Yunlong Xi , Feng Lv

show 15 more authors

Xiaoming Wu Zeyu Liu Cong Wan Pu Li Ruiqing Yang Xiaoou Li Wei Wang Kangkang Zhu Yuwei Zhang Shi Fu Zheng Zhang Xiaoning Wu Xuzeng Fan Dacheng Tao Xiaogang Wang

This is my paper

Pith reviewed 2026-06-27 03:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords world modelsphysical AIembodied AItemporal attentionhybrid attentionerror boundspre-training curriculumdeployment co-design

0 comments

The pith

Kairos introduces a world model stack that learns from mixed embodiment data and maintains states over long horizons with mathematically bounded error via hybrid attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kairos as a complete native stack for world models aimed at physical AI applications. It organizes pre-training around a curriculum that progresses from open-world videos to human behavior and robot interactions. A unified architecture combines world understanding, generation, and prediction through hybrid linear temporal attention that factors local, mid-range, and global dependencies. Formal bounds are stated to show this factorization strictly limits error accumulation while guaranteeing state propagation across extended time horizons. The system also includes deployment co-design for low-latency operation on varied hardware, and experiments report top-level results on embodied, long-horizon, and policy benchmarks alongside favorable efficiency trade-offs.

Core claim

Kairos pioneers a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum that sequences heterogeneous experience into a developmental pathway. It maintains the world through a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention handles local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention sustains global memory. Formal theoretical bounds demonstrate that this temporal factorization strictly limits error accumulation and mathematically guarantees state propagation across extended horizons. Deployment-aware system co-design enables low-latency rollout on server and consumer

What carries the argument

Hybrid Linear Temporal Attention that combines sliding-window attention for local dynamics, dilated sliding windows for mid-range dependencies, and gated linear attention for persistent global memory, carrying the theoretical bounds on error accumulation.

If this is right

Enables low-latency observation-action-feedback loops on consumer-grade hardware.
Organizes open-world videos, human data, and robot interactions into a single progressive training pathway.
Delivers top-level results on embodied world-model and long-horizon benchmarks while preserving efficiency.
Supplies mathematical guarantees for state propagation that support extended physical AI operation.
Forms an operational foundation for future self-evolving physical intelligence systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The error-bound approach could reduce reliance on frequent model resets or heavy retraining in deployed robotic systems.
Integration with existing reinforcement learning loops might improve sample efficiency in policy learning without extra compute scaling.
Testing the curriculum on additional data modalities could reveal whether the bounds hold when embodiment gaps widen further.

Load-bearing premise

The Hybrid Linear Temporal Attention mechanism with sliding-window, dilated, and gated linear components produces the claimed strict limit on error accumulation and state propagation guarantees.

What would settle it

A controlled long-horizon rollout test on an embodied benchmark that measures whether prediction error exceeds the theoretical bound derived from the temporal factorization.

read the original abstract

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kairos bundles a data curriculum, hybrid linear temporal attention, and hardware co-design into one world-model stack, but the asserted formal bounds on error accumulation still need the actual derivations to be checked.

read the letter

The main point is that this paper puts forward a full native stack for world models aimed at physical AI. It uses a progressive curriculum across open videos, human data, and robot interactions for pre-training, a hybrid attention block that mixes sliding-window, dilated, and gated linear components for long-horizon state keeping, and a co-design layer to keep rollouts fast on both server and edge hardware.

What stands out as useful is the system-level integration. Treating perception, prediction, and action as one trainable pipeline rather than bolted-on modules is a reasonable direction, and the curriculum idea gives a concrete way to stage the data. The attention design tries to avoid quadratic cost while covering different time scales, which matches real deployment needs.

The soft spot is the theoretical claim. The abstract states that the temporal factorization strictly limits error accumulation and mathematically guarantees state propagation over long horizons. The stress-test note is right that no equations, assumptions, or proof sketch appear in the provided abstract, so it is impossible to verify whether the bound actually follows from the described components or rests on unstated restrictions like bounded state norms. If the full paper contains the derivation, it should be front and center; if not, the guarantee remains an assertion.

Benchmark claims of top performance with a good efficiency trade-off are stated but cannot be weighed without the specific baselines, ablations, and error bars. The work is aimed at people building embodied systems who want an integrated starting point rather than separate papers on each piece. It is coherent enough on its own terms to warrant a serious referee, mainly to get the theory and comparisons properly examined.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Kairos, a native world model stack for Physical AI. It features (1) a Cross-Embodiment Data Curriculum for native pre-training on open-world videos, human data, and robot interactions; (2) a Native Unified Architecture with Hybrid Linear Temporal Attention (sliding-window for local dynamics, dilated windows for mid-range, gated linear for global memory) that claims formal theoretical bounds strictly limiting error accumulation and guaranteeing state propagation over long horizons; and (3) Deployment-Aware System Co-Design for low-latency rollouts. Experiments claim top-level performance with strong efficiency trade-offs on embodied world-model, long-horizon, and action-policy benchmarks.

Significance. If the formal bounds are rigorously derived and the benchmark results hold with proper controls and baselines, the work would be significant for providing an integrated, deployment-ready foundation for physical AI that addresses long-horizon state maintenance and efficiency, moving beyond passive visual world models.

major comments (2)

[Abstract] Abstract: the assertion of 'formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation' and 'mathematically guarantees state propagation across extended horizons' is made without any equations, proof sketch, assumptions (e.g., bounded state norms, properties of the gating function), or derivation steps showing how sliding-window + dilated + gated linear attention produces these properties rather than standard attention; this is load-bearing for the central theoretical claim.
[Abstract] Abstract (and Experiments section): claims of 'top level performance' and 'strong efficiency-capability trade-off' on embodied, long-horizon, and action-policy benchmarks are stated without metrics, baselines, error bars, dataset sizes, or statistical details, preventing assessment of whether superiority is demonstrated or if results reduce to unstated fitting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation' and 'mathematically guarantees state propagation across extended horizons' is made without any equations, proof sketch, assumptions (e.g., bounded state norms, properties of the gating function), or derivation steps showing how sliding-window + dilated + gated linear attention produces these properties rather than standard attention; this is load-bearing for the central theoretical claim.

Authors: We agree the abstract states the claim at a high level without derivation details. The full manuscript contains the rigorous derivation in Section 4, including assumptions (bounded state norms, Lipschitz properties of the gating function), the error-bound proof for the hybrid factorization versus standard attention, and the state-propagation guarantee over long horizons. To address the concern, we will revise the abstract to briefly note the key assumptions and reference Section 4 for the complete analysis and proof sketch. revision: partial
Referee: [Abstract] Abstract (and Experiments section): claims of 'top level performance' and 'strong efficiency-capability trade-off' on embodied, long-horizon, and action-policy benchmarks are stated without metrics, baselines, error bars, dataset sizes, or statistical details, preventing assessment of whether superiority is demonstrated or if results reduce to unstated fitting.

Authors: The abstract summarizes high-level outcomes; all requested details (specific metrics, baselines, error bars, dataset sizes, and statistical tests) appear in Section 5 with Tables 1–4 and Figures 3–6. We will revise the abstract to include one or two key quantitative results (e.g., relative gains and efficiency metrics) for improved clarity while preserving length constraints. No changes are required in the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical bounds asserted independently of inputs

full rationale

The paper claims to establish formal theoretical bounds showing that the Hybrid Linear Temporal Attention factorization strictly limits error accumulation and guarantees state propagation. The provided text contains no equations, no derivation steps, no fitted parameters renamed as predictions, and no self-citations used to justify the bounds. Without any exhibited reduction of the claimed result to its own inputs by construction, the derivation chain is self-contained. This is the expected honest non-finding when no load-bearing circular step can be quoted.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; full manuscript required for ledger construction.

pith-pipeline@v0.9.1-grok · 5850 in / 1089 out tokens · 68707 ms · 2026-06-27T03:59:15.574766+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

188 extracted references · 58 linked inside Pith

[1]

Learning to model the world: A survey of world models in artificial intelligence.TechRxiv, 2026

Jiahua Dong, Qi Lyu, Baichen Liu, Xudong Wang, Wenqi Liang, Duzhen Zhang, Jiahang Tu, Hongliu Li, Hanbin Zhao, Henghui Ding, Yulun Zhang, Zhi Han, Nicu Sebe, Fahad Shahbaz Khan, Salman Khan, Mubarak Shah, Philip Torr, Ming-Hsuan Yang, and Dacheng Tao. Learning to model the world: A survey of world models in artificial intelligence.TechRxiv, 2026

2026
[2]

NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jin...

2025
[3]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...

Pith/arXiv arXiv 2025
[4]

V-jepa 2.1: Unlocking dense features in video self-supervised learning

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning. arXiv preprint arXiv:2603.14482, 2026

Pith/arXiv arXiv 2026
[5]

Back to the features: Dino as a foundation for video world models, 2025

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models, 2025

2025
[6]

Marble: A multimodal world model.https://marble.worldlabs.ai/, 2025

World Labs. Marble: A multimodal world model.https://marble.worldlabs.ai/, 2025

2025
[7]

Teleworld: Towards dynamic multimodal synthesis with a 4d world model, 2025

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan’er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, and Xuelong Li. Teleworld: T...

2025
[8]

Genie 3: A new frontier for world models

Google DeepMind. Genie 3: A new frontier for world models. https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

2025
[9]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

2025
[10]

Lingbot-world: An interactive world model for embodied intelligence.arXiv preprint, January 2026

Robbyant Team. Lingbot-world: An interactive world model for embodied intelligence.arXiv preprint, January 2026. Ant Group Robbyant Technology

2026
[11]

Training agents inside of scalable world models, 2025

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025

2025
[12]

Worldmodelbench: Judging video generation models as world models

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong 61 Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. arXiv preprint arXiv:2502.20694, 2025

arXiv 2025
[13]

Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv preprint arXiv:2505.12705, 2025

Jim Fan, Yoel Jang, Ireayo Akinola, et al. Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv preprint arXiv:2505.12705, 2025. Introduces DreamGen Bench, a video generation benchmark for robot learning

Pith/arXiv arXiv 2025
[14]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025

2025
[15]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026
[16]

Ace-brain-0: Spatial intelligence as a shared scaffold for universal embodiments, 2026

Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu, Zhi Hou, Ganlin Yang, Weiyun Wang, Xiaofeng Wang, Jianbo Liu, Gen Luo, Haolan Kang, Shuang Luo, Yue Zhou, Yong Luo, Li Shen, Xiaosong Jia, Yao Mu, Xue Yang, Chunxiao Liu, Junchi Yan, Hengshuang Zhao, Dacheng Tao, and Xiaogang Wang. Ace-brain-0: Spatial intelligence as a shared scaffold for universal emb...

2026
[18]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In The Thirteenth International Conference on Learning Representations
[19]

AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialun Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yue Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng-Xing Ruan, Jiaqi Shan, Yongjian Shen, Ch...

2025
[20]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...

2024
[21]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023
[22]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. 62

2023
[23]

Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

Pith/arXiv arXiv 2024
[25]

Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025

Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025

Pith/arXiv arXiv 2025
[26]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025
[27]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

2019
[28]

Fusionbench: A comprehensive benchmark of deep model fusion.Journal of MachineLearning Research, 2025

Anke Tang, Li Shen, Yong Luo, Enneng Yang, Han Hu, Lefei Zhang, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.Journal of MachineLearning Research, 2025

2025
[29]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities.ACM Comput

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities.ACM Comput. Surv., 58(8), February 2026

2026
[30]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022

2022
[31]

Revisiting weight averaging for model merging

Jiho Choi, Donggyun Kim, Chanhyuk Lee, and Seunghoon Hong. Revisiting weight averaging for model merging. arXiv preprint arXiv:2412.12153, 2024

arXiv 2024
[32]

Ties-merging: Resolving interference when merging models, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models, 2023

2023
[33]

Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024

2024
[34]

Whoever started the interference should end it: Guiding data-free model merging via task vectors, 2025

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors, 2025

2025
[35]

Diffusion model alignment using direct preference optimization, 2023

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023

2023
[36]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

2024
[37]

Videodpo: Omni-preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 63

2025
[38]

Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

2023
[39]

Ring attention with blockwise transformers for near-infinite context, 2023

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023

2023
[40]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[41]

Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, and Siyu Zhu. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9365–9374, 2025

2025
[42]

Vidgen-1m: A large-scale dataset for text-to-video generation

Zirui Tan, Yandong Li, Yaliang Li, and Jingren Zhou. Vidgen-1m: A large-scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629, 2024

arXiv 2024
[43]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419. Springer, 2020

2020
[44]

Fine-tuned vision transformer for nsfw image classification

FalconsAI Team. Fine-tuned vision transformer for nsfw image classification. Hugging Face Model Hub,
[45]

Initial commit 2023-10-14, Last updated 2025-04-06, Apache-2.0 License, 80k training images, 98.04% accuracy, 85.8M params

2023
[46]

Yolox: Exceeding yolo series in 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021

Pith/arXiv arXiv 2021
[47]

Bytetrack: Multi-object tracking by associating every detection box.arXiv preprint arXiv:2110.06864, 2021

YifuZhang, PeizeSun, YiJiang, DongdongYu, ZehuanYuan, PingLuo, WenyuLiu, andXinggangWang. Bytetrack: Multi-object tracking by associating every detection box.arXiv preprint arXiv:2110.06864, 2021

arXiv 2021
[48]

Real-time scene text detection with differentiable binarization.arXiv preprint arXiv:1911.08947, 2019

Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable binarization.arXiv preprint arXiv:1911.08947, 2019

arXiv 1911
[49]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

2025
[50]

Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025
[51]

Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining, 2025

Xiaomi LLM-Core Team. Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining, 2025

2025
[52]

Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

Pith/arXiv arXiv 2025
[53]

Seedance 2.0: Advancing video generation for world complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148, apr 2026

Pith/arXiv arXiv 2026
[54]

Freeman, and Taesung Park

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation, 2024

2024
[55]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024

2024
[56]

Consistency models, 2023

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023. 64

2023
[57]

Simplifying, stabilizing and scaling continuous-time consistency models, 2025

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models, 2025

2025
[58]

Large scale diffusion distillation via score-regularized continuous-time consistency, 2026

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency, 2026

2026
[59]

Pai-bench: A compre- hensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, and Humphrey Shi. Pai-bench: A compre- hensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

arXiv 2025
[60]

Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Y...

2026
[61]

Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang 65 Zeng, Junjin Xiao, Xinyuan Chang, et al. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

arXiv 2026
[62]

Gigaworld-0: World models as data engine to empower embodied ai, 2025

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. Gigaworld-0: World models as data engine to em...

2025
[63]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21807–21818. IEEE, 2024

2024
[64]

Wan 2.5: Open-source ai video generation with audio.https://wan.video, 2025

Alibaba Tongyi Lab. Wan 2.5: Open-source ai video generation with audio.https://wan.video, 2025. Official product page

2025
[65]

Veo 3.1.https://deepmind.google/models/veo/, 2025

Google DeepMind. Veo 3.1.https://deepmind.google/models/veo/, 2025. Official model page

2025
[66]

Wow: Towards a world omniscient world model through embodied interaction, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou...

2025
[67]

Sora 2 pro.https://platform.openai.com/docs/models/sora-2-pro, 2025

OpenAI. Sora 2 pro.https://platform.openai.com/docs/models/sora-2-pro, 2025. Official model documentation, accessed 2026-06-08

2025
[68]

Unifolm-wma-0: A world-model-action (wma) framework under the unifolm family

Unitree Robotics. Unifolm-wma-0: A world-model-action (wma) framework under the unifolm family. https://huggingface.co/unitreerobotics/UnifoLM-WMA-0-Base, 2025. Hugging Face model card, accessed June 2026

2025
[69]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

2025
[70]

Qwen3.5-2B

Qwen Team. Qwen3.5-2B. https://huggingface.co/Qwen/Qwen3.5-2B, 2025. Model card and benchmark results. Accessed: 2026-06-10

2025
[71]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

2024
[72]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025

2025
[73]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

2024
[74]

Grok-1.5 Vision Preview.https://x.ai/blog/grok-1.5v, 2024

xAI. Grok-1.5 Vision Preview.https://x.ai/blog/grok-1.5v, 2024. Accessed: 2026-06-10

2024
[75]

Are we on the right way for evaluating large vision-language models?, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. 66

2024
[76]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[77]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

2026
[78]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025

2025
[79]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

2025
[80]

Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

Pith/arXiv arXiv 2026
[81]

Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Pith/arXiv arXiv 2026
[82]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692v1, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Shuailei Ma, He Sun, Yong Wang, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Shuai Zhou, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Qian Zhu, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692v1, 2026

Pith/arXiv arXiv 2026

Showing first 80 references.

[1] [1]

Learning to model the world: A survey of world models in artificial intelligence.TechRxiv, 2026

Jiahua Dong, Qi Lyu, Baichen Liu, Xudong Wang, Wenqi Liang, Duzhen Zhang, Jiahang Tu, Hongliu Li, Hanbin Zhao, Henghui Ding, Yulun Zhang, Zhi Han, Nicu Sebe, Fahad Shahbaz Khan, Salman Khan, Mubarak Shah, Philip Torr, Ming-Hsuan Yang, and Dacheng Tao. Learning to model the world: A survey of world models in artificial intelligence.TechRxiv, 2026

2026

[2] [2]

NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jin...

2025

[3] [3]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...

Pith/arXiv arXiv 2025

[4] [4]

V-jepa 2.1: Unlocking dense features in video self-supervised learning

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning. arXiv preprint arXiv:2603.14482, 2026

Pith/arXiv arXiv 2026

[5] [5]

Back to the features: Dino as a foundation for video world models, 2025

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models, 2025

2025

[6] [6]

Marble: A multimodal world model.https://marble.worldlabs.ai/, 2025

World Labs. Marble: A multimodal world model.https://marble.worldlabs.ai/, 2025

2025

[7] [7]

Teleworld: Towards dynamic multimodal synthesis with a 4d world model, 2025

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan’er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, and Xuelong Li. Teleworld: T...

2025

[8] [8]

Genie 3: A new frontier for world models

Google DeepMind. Genie 3: A new frontier for world models. https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

2025

[9] [9]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

2025

[10] [10]

Lingbot-world: An interactive world model for embodied intelligence.arXiv preprint, January 2026

Robbyant Team. Lingbot-world: An interactive world model for embodied intelligence.arXiv preprint, January 2026. Ant Group Robbyant Technology

2026

[11] [11]

Training agents inside of scalable world models, 2025

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025

2025

[12] [12]

Worldmodelbench: Judging video generation models as world models

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong 61 Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. arXiv preprint arXiv:2502.20694, 2025

arXiv 2025

[13] [13]

Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv preprint arXiv:2505.12705, 2025

Jim Fan, Yoel Jang, Ireayo Akinola, et al. Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv preprint arXiv:2505.12705, 2025. Introduces DreamGen Bench, a video generation benchmark for robot learning

Pith/arXiv arXiv 2025

[14] [14]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025

2025

[15] [15]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026

[16] [16]

Ace-brain-0: Spatial intelligence as a shared scaffold for universal embodiments, 2026

Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu, Zhi Hou, Ganlin Yang, Weiyun Wang, Xiaofeng Wang, Jianbo Liu, Gen Luo, Haolan Kang, Shuang Luo, Yue Zhou, Yong Luo, Li Shen, Xiaosong Jia, Yao Mu, Xue Yang, Chunxiao Liu, Junchi Yan, Hengshuang Zhao, Dacheng Tao, and Xiaogang Wang. Ace-brain-0: Spatial intelligence as a shared scaffold for universal emb...

2026

[17] [18]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In The Thirteenth International Conference on Learning Representations

[18] [19]

AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialun Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yue Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng-Xing Ruan, Jiaqi Shan, Yongjian Shen, Ch...

2025

[19] [20]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...

2024

[20] [21]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023

[21] [22]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. 62

2023

[22] [23]

Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

Pith/arXiv arXiv 2024

[23] [25]

Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025

Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025

Pith/arXiv arXiv 2025

[24] [26]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025

[25] [27]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

2019

[26] [28]

Fusionbench: A comprehensive benchmark of deep model fusion.Journal of MachineLearning Research, 2025

Anke Tang, Li Shen, Yong Luo, Enneng Yang, Han Hu, Lefei Zhang, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.Journal of MachineLearning Research, 2025

2025

[27] [29]

Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities.ACM Comput

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities.ACM Comput. Surv., 58(8), February 2026

2026

[28] [30]

Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022

2022

[29] [31]

Revisiting weight averaging for model merging

Jiho Choi, Donggyun Kim, Chanhyuk Lee, and Seunghoon Hong. Revisiting weight averaging for model merging. arXiv preprint arXiv:2412.12153, 2024

arXiv 2024

[30] [32]

Ties-merging: Resolving interference when merging models, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models, 2023

2023

[31] [33]

Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024

2024

[32] [34]

Whoever started the interference should end it: Guiding data-free model merging via task vectors, 2025

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors, 2025

2025

[33] [35]

Diffusion model alignment using direct preference optimization, 2023

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023

2023

[34] [36]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

2024

[35] [37]

Videodpo: Omni-preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 63

2025

[36] [38]

Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

2023

[37] [39]

Ring attention with blockwise transformers for near-infinite context, 2023

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023

2023

[38] [40]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[39] [41]

Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, and Siyu Zhu. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9365–9374, 2025

2025

[40] [42]

Vidgen-1m: A large-scale dataset for text-to-video generation

Zirui Tan, Yandong Li, Yaliang Li, and Jingren Zhou. Vidgen-1m: A large-scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629, 2024

arXiv 2024

[41] [43]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419. Springer, 2020

2020

[42] [44]

Fine-tuned vision transformer for nsfw image classification

FalconsAI Team. Fine-tuned vision transformer for nsfw image classification. Hugging Face Model Hub,

[43] [45]

Initial commit 2023-10-14, Last updated 2025-04-06, Apache-2.0 License, 80k training images, 98.04% accuracy, 85.8M params

2023

[44] [46]

Yolox: Exceeding yolo series in 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021

Pith/arXiv arXiv 2021

[45] [47]

Bytetrack: Multi-object tracking by associating every detection box.arXiv preprint arXiv:2110.06864, 2021

YifuZhang, PeizeSun, YiJiang, DongdongYu, ZehuanYuan, PingLuo, WenyuLiu, andXinggangWang. Bytetrack: Multi-object tracking by associating every detection box.arXiv preprint arXiv:2110.06864, 2021

arXiv 2021

[46] [48]

Real-time scene text detection with differentiable binarization.arXiv preprint arXiv:1911.08947, 2019

Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable binarization.arXiv preprint arXiv:1911.08947, 2019

arXiv 1911

[47] [49]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

2025

[48] [50]

Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Pith/arXiv arXiv 2025

[49] [51]

Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining, 2025

Xiaomi LLM-Core Team. Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining, 2025

2025

[50] [52]

Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

Pith/arXiv arXiv 2025

[51] [53]

Seedance 2.0: Advancing video generation for world complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148, apr 2026

Pith/arXiv arXiv 2026

[52] [54]

Freeman, and Taesung Park

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation, 2024

2024

[53] [55]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024

2024

[54] [56]

Consistency models, 2023

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023. 64

2023

[55] [57]

Simplifying, stabilizing and scaling continuous-time consistency models, 2025

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models, 2025

2025

[56] [58]

Large scale diffusion distillation via score-regularized continuous-time consistency, 2026

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency, 2026

2026

[57] [59]

Pai-bench: A compre- hensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, and Humphrey Shi. Pai-bench: A compre- hensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

arXiv 2025

[58] [60]

Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Y...

2026

[59] [61]

Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang 65 Zeng, Junjin Xiao, Xinyuan Chang, et al. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

arXiv 2026

[60] [62]

Gigaworld-0: World models as data engine to empower embodied ai, 2025

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. Gigaworld-0: World models as data engine to em...

2025

[61] [63]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21807–21818. IEEE, 2024

2024

[62] [64]

Wan 2.5: Open-source ai video generation with audio.https://wan.video, 2025

Alibaba Tongyi Lab. Wan 2.5: Open-source ai video generation with audio.https://wan.video, 2025. Official product page

2025

[63] [65]

Veo 3.1.https://deepmind.google/models/veo/, 2025

Google DeepMind. Veo 3.1.https://deepmind.google/models/veo/, 2025. Official model page

2025

[64] [66]

Wow: Towards a world omniscient world model through embodied interaction, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou...

2025

[65] [67]

Sora 2 pro.https://platform.openai.com/docs/models/sora-2-pro, 2025

OpenAI. Sora 2 pro.https://platform.openai.com/docs/models/sora-2-pro, 2025. Official model documentation, accessed 2026-06-08

2025

[66] [68]

Unifolm-wma-0: A world-model-action (wma) framework under the unifolm family

Unitree Robotics. Unifolm-wma-0: A world-model-action (wma) framework under the unifolm family. https://huggingface.co/unitreerobotics/UnifoLM-WMA-0-Base, 2025. Hugging Face model card, accessed June 2026

2025

[67] [69]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

2025

[68] [70]

Qwen3.5-2B

Qwen Team. Qwen3.5-2B. https://huggingface.co/Qwen/Qwen3.5-2B, 2025. Model card and benchmark results. Accessed: 2026-06-10

2025

[69] [71]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

2024

[70] [72]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025

2025

[71] [73]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

2024

[72] [74]

Grok-1.5 Vision Preview.https://x.ai/blog/grok-1.5v, 2024

xAI. Grok-1.5 Vision Preview.https://x.ai/blog/grok-1.5v, 2024. Accessed: 2026-06-10

2024

[73] [75]

Are we on the right way for evaluating large vision-language models?, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. 66

2024

[74] [76]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[75] [77]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

2026

[76] [78]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025

2025

[77] [79]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

2025

[78] [80]

Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

Pith/arXiv arXiv 2026

[79] [81]

Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Pith/arXiv arXiv 2026

[80] [82]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692v1, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Shuailei Ma, He Sun, Yong Wang, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Shuai Zhou, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Qian Zhu, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692v1, 2026

Pith/arXiv arXiv 2026