pith. sign in

arxiv: 2606.16533 · v2 · pith:ELUIGAAZnew · submitted 2026-06-15 · 💻 cs.AI · cs.CV

Kairos: A Native World Model Stack for Physical AI

Pith reviewed 2026-06-27 03:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords world modelsphysical AIembodied AItemporal attentionhybrid attentionerror boundspre-training curriculumdeployment co-design
0
0 comments X

The pith

Kairos introduces a world model stack that learns from mixed embodiment data and maintains states over long horizons with mathematically bounded error via hybrid attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kairos as a complete native stack for world models aimed at physical AI applications. It organizes pre-training around a curriculum that progresses from open-world videos to human behavior and robot interactions. A unified architecture combines world understanding, generation, and prediction through hybrid linear temporal attention that factors local, mid-range, and global dependencies. Formal bounds are stated to show this factorization strictly limits error accumulation while guaranteeing state propagation across extended time horizons. The system also includes deployment co-design for low-latency operation on varied hardware, and experiments report top-level results on embodied, long-horizon, and policy benchmarks alongside favorable efficiency trade-offs.

Core claim

Kairos pioneers a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum that sequences heterogeneous experience into a developmental pathway. It maintains the world through a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention handles local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention sustains global memory. Formal theoretical bounds demonstrate that this temporal factorization strictly limits error accumulation and mathematically guarantees state propagation across extended horizons. Deployment-aware system co-design enables low-latency rollout on server and consumer

What carries the argument

Hybrid Linear Temporal Attention that combines sliding-window attention for local dynamics, dilated sliding windows for mid-range dependencies, and gated linear attention for persistent global memory, carrying the theoretical bounds on error accumulation.

If this is right

  • Enables low-latency observation-action-feedback loops on consumer-grade hardware.
  • Organizes open-world videos, human data, and robot interactions into a single progressive training pathway.
  • Delivers top-level results on embodied world-model and long-horizon benchmarks while preserving efficiency.
  • Supplies mathematical guarantees for state propagation that support extended physical AI operation.
  • Forms an operational foundation for future self-evolving physical intelligence systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The error-bound approach could reduce reliance on frequent model resets or heavy retraining in deployed robotic systems.
  • Integration with existing reinforcement learning loops might improve sample efficiency in policy learning without extra compute scaling.
  • Testing the curriculum on additional data modalities could reveal whether the bounds hold when embodiment gaps widen further.

Load-bearing premise

The Hybrid Linear Temporal Attention mechanism with sliding-window, dilated, and gated linear components produces the claimed strict limit on error accumulation and state propagation guarantees.

What would settle it

A controlled long-horizon rollout test on an embodied benchmark that measures whether prediction error exceeds the theoretical bound derived from the temporal factorization.

read the original abstract

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Kairos, a native world model stack for Physical AI. It features (1) a Cross-Embodiment Data Curriculum for native pre-training on open-world videos, human data, and robot interactions; (2) a Native Unified Architecture with Hybrid Linear Temporal Attention (sliding-window for local dynamics, dilated windows for mid-range, gated linear for global memory) that claims formal theoretical bounds strictly limiting error accumulation and guaranteeing state propagation over long horizons; and (3) Deployment-Aware System Co-Design for low-latency rollouts. Experiments claim top-level performance with strong efficiency trade-offs on embodied world-model, long-horizon, and action-policy benchmarks.

Significance. If the formal bounds are rigorously derived and the benchmark results hold with proper controls and baselines, the work would be significant for providing an integrated, deployment-ready foundation for physical AI that addresses long-horizon state maintenance and efficiency, moving beyond passive visual world models.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation' and 'mathematically guarantees state propagation across extended horizons' is made without any equations, proof sketch, assumptions (e.g., bounded state norms, properties of the gating function), or derivation steps showing how sliding-window + dilated + gated linear attention produces these properties rather than standard attention; this is load-bearing for the central theoretical claim.
  2. [Abstract] Abstract (and Experiments section): claims of 'top level performance' and 'strong efficiency-capability trade-off' on embodied, long-horizon, and action-policy benchmarks are stated without metrics, baselines, error bars, dataset sizes, or statistical details, preventing assessment of whether superiority is demonstrated or if results reduce to unstated fitting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation' and 'mathematically guarantees state propagation across extended horizons' is made without any equations, proof sketch, assumptions (e.g., bounded state norms, properties of the gating function), or derivation steps showing how sliding-window + dilated + gated linear attention produces these properties rather than standard attention; this is load-bearing for the central theoretical claim.

    Authors: We agree the abstract states the claim at a high level without derivation details. The full manuscript contains the rigorous derivation in Section 4, including assumptions (bounded state norms, Lipschitz properties of the gating function), the error-bound proof for the hybrid factorization versus standard attention, and the state-propagation guarantee over long horizons. To address the concern, we will revise the abstract to briefly note the key assumptions and reference Section 4 for the complete analysis and proof sketch. revision: partial

  2. Referee: [Abstract] Abstract (and Experiments section): claims of 'top level performance' and 'strong efficiency-capability trade-off' on embodied, long-horizon, and action-policy benchmarks are stated without metrics, baselines, error bars, dataset sizes, or statistical details, preventing assessment of whether superiority is demonstrated or if results reduce to unstated fitting.

    Authors: The abstract summarizes high-level outcomes; all requested details (specific metrics, baselines, error bars, dataset sizes, and statistical tests) appear in Section 5 with Tables 1–4 and Figures 3–6. We will revise the abstract to include one or two key quantitative results (e.g., relative gains and efficiency metrics) for improved clarity while preserving length constraints. No changes are required in the experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical bounds asserted independently of inputs

full rationale

The paper claims to establish formal theoretical bounds showing that the Hybrid Linear Temporal Attention factorization strictly limits error accumulation and guarantees state propagation. The provided text contains no equations, no derivation steps, no fitted parameters renamed as predictions, and no self-citations used to justify the bounds. Without any exhibited reduction of the claimed result to its own inputs by construction, the derivation chain is self-contained. This is the expected honest non-finding when no load-bearing circular step can be quoted.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; full manuscript required for ledger construction.

pith-pipeline@v0.9.1-grok · 5850 in / 1089 out tokens · 68707 ms · 2026-06-27T03:59:15.574766+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

188 extracted references · 58 linked inside Pith

  1. [1]

    Learning to model the world: A survey of world models in artificial intelligence.TechRxiv, 2026

    Jiahua Dong, Qi Lyu, Baichen Liu, Xudong Wang, Wenqi Liang, Duzhen Zhang, Jiahang Tu, Hongliu Li, Hanbin Zhao, Henghui Ding, Yulun Zhang, Zhi Han, Nicu Sebe, Fahad Shahbaz Khan, Salman Khan, Mubarak Shah, Philip Torr, Ming-Hsuan Yang, and Dacheng Tao. Learning to model the world: A survey of world models in artificial intelligence.TechRxiv, 2026

  2. [2]

    NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jin...

  3. [3]

    V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

    Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...

  4. [4]

    V-jepa 2.1: Unlocking dense features in video self-supervised learning

    Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning. arXiv preprint arXiv:2603.14482, 2026

  5. [5]

    Back to the features: Dino as a foundation for video world models, 2025

    Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models, 2025

  6. [6]

    Marble: A multimodal world model.https://marble.worldlabs.ai/, 2025

    World Labs. Marble: A multimodal world model.https://marble.worldlabs.ai/, 2025

  7. [7]

    Teleworld: Towards dynamic multimodal synthesis with a 4d world model, 2025

    Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan’er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, and Xuelong Li. Teleworld: T...

  8. [8]

    Genie 3: A new frontier for world models

    Google DeepMind. Genie 3: A new frontier for world models. https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

  9. [9]

    Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

    Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

  10. [10]

    Lingbot-world: An interactive world model for embodied intelligence.arXiv preprint, January 2026

    Robbyant Team. Lingbot-world: An interactive world model for embodied intelligence.arXiv preprint, January 2026. Ant Group Robbyant Technology

  11. [11]

    Training agents inside of scalable world models, 2025

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025

  12. [12]

    Worldmodelbench: Judging video generation models as world models

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong 61 Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. arXiv preprint arXiv:2502.20694, 2025

  13. [13]

    Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv preprint arXiv:2505.12705, 2025

    Jim Fan, Yoel Jang, Ireayo Akinola, et al. Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv preprint arXiv:2505.12705, 2025. Introduces DreamGen Bench, a video generation benchmark for robot learning

  14. [14]

    Qwen2.5-vl, January 2025

    Qwen Team. Qwen2.5-vl, January 2025

  15. [15]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  16. [16]

    Ace-brain-0: Spatial intelligence as a shared scaffold for universal embodiments, 2026

    Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu, Zhi Hou, Ganlin Yang, Weiyun Wang, Xiaofeng Wang, Jianbo Liu, Gen Luo, Haolan Kang, Shuang Luo, Yue Zhou, Yong Luo, Li Shen, Xiaosong Jia, Yao Mu, Xue Yang, Chunxiao Liu, Junchi Yan, Hengshuang Zhao, Dacheng Tao, and Xiaogang Wang. Ace-brain-0: Spatial intelligence as a shared scaffold for universal emb...

  17. [18]

    Gated delta networks: Improving mamba2 with delta rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In The Thirteenth International Conference on Learning Representations

  18. [19]

    AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialun Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yue Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng-Xing Ruan, Jiaqi Shan, Yongjian Shen, Ch...

  19. [20]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...

  20. [21]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  21. [22]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. 62

  22. [23]

    Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

  23. [25]

    Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025

    Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025

  24. [26]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  25. [27]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

  26. [28]

    Fusionbench: A comprehensive benchmark of deep model fusion.Journal of MachineLearning Research, 2025

    Anke Tang, Li Shen, Yong Luo, Enneng Yang, Han Hu, Lefei Zhang, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.Journal of MachineLearning Research, 2025

  27. [29]

    Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities.ACM Comput

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities.ACM Comput. Surv., 58(8), February 2026

  28. [30]

    Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

    Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022

  29. [31]

    Revisiting weight averaging for model merging

    Jiho Choi, Donggyun Kim, Chanhyuk Lee, and Seunghoon Hong. Revisiting weight averaging for model merging. arXiv preprint arXiv:2412.12153, 2024

  30. [32]

    Ties-merging: Resolving interference when merging models, 2023

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models, 2023

  31. [33]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024

  32. [34]

    Whoever started the interference should end it: Guiding data-free model merging via task vectors, 2025

    Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors, 2025

  33. [35]

    Diffusion model alignment using direct preference optimization, 2023

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023

  34. [36]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

  35. [37]

    Videodpo: Omni-preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 63

  36. [38]

    Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023

  37. [39]

    Ring attention with blockwise transformers for near-infinite context, 2023

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023

  38. [40]

    Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  39. [41]

    Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

    Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, and Siyu Zhu. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9365–9374, 2025

  40. [42]

    Vidgen-1m: A large-scale dataset for text-to-video generation

    Zirui Tan, Yandong Li, Yaliang Li, and Jingren Zhou. Vidgen-1m: A large-scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629, 2024

  41. [43]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419. Springer, 2020

  42. [44]

    Fine-tuned vision transformer for nsfw image classification

    FalconsAI Team. Fine-tuned vision transformer for nsfw image classification. Hugging Face Model Hub,

  43. [45]

    Initial commit 2023-10-14, Last updated 2025-04-06, Apache-2.0 License, 80k training images, 98.04% accuracy, 85.8M params

  44. [46]

    Yolox: Exceeding yolo series in 2021

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021

  45. [47]

    Bytetrack: Multi-object tracking by associating every detection box.arXiv preprint arXiv:2110.06864, 2021

    YifuZhang, PeizeSun, YiJiang, DongdongYu, ZehuanYuan, PingLuo, WenyuLiu, andXinggangWang. Bytetrack: Multi-object tracking by associating every detection box.arXiv preprint arXiv:2110.06864, 2021

  46. [48]

    Real-time scene text detection with differentiable binarization.arXiv preprint arXiv:1911.08947, 2019

    Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable binarization.arXiv preprint arXiv:1911.08947, 2019

  47. [49]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  48. [50]

    Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  49. [51]

    Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining, 2025

    Xiaomi LLM-Core Team. Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining, 2025

  50. [52]

    Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

  51. [53]

    Seedance 2.0: Advancing video generation for world complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148, apr 2026

  52. [54]

    Freeman, and Taesung Park

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation, 2024

  53. [55]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024

  54. [56]

    Consistency models, 2023

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023. 64

  55. [57]

    Simplifying, stabilizing and scaling continuous-time consistency models, 2025

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models, 2025

  56. [58]

    Large scale diffusion distillation via score-regularized continuous-time consistency, 2026

    Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency, 2026

  57. [59]

    Pai-bench: A compre- hensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

    Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, and Humphrey Shi. Pai-bench: A compre- hensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

  58. [60]

    Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Y...

  59. [61]

    Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

    Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang 65 Zeng, Junjin Xiao, Xinyuan Chang, et al. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

  60. [62]

    Gigaworld-0: World models as data engine to empower embodied ai, 2025

    GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. Gigaworld-0: World models as data engine to em...

  61. [63]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21807–21818. IEEE, 2024

  62. [64]

    Wan 2.5: Open-source ai video generation with audio.https://wan.video, 2025

    Alibaba Tongyi Lab. Wan 2.5: Open-source ai video generation with audio.https://wan.video, 2025. Official product page

  63. [65]

    Veo 3.1.https://deepmind.google/models/veo/, 2025

    Google DeepMind. Veo 3.1.https://deepmind.google/models/veo/, 2025. Official model page

  64. [66]

    Wow: Towards a world omniscient world model through embodied interaction, 2025

    Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou...

  65. [67]

    Sora 2 pro.https://platform.openai.com/docs/models/sora-2-pro, 2025

    OpenAI. Sora 2 pro.https://platform.openai.com/docs/models/sora-2-pro, 2025. Official model documentation, accessed 2026-06-08

  66. [68]

    Unifolm-wma-0: A world-model-action (wma) framework under the unifolm family

    Unitree Robotics. Unifolm-wma-0: A world-model-action (wma) framework under the unifolm family. https://huggingface.co/unitreerobotics/UnifoLM-WMA-0-Base, 2025. Hugging Face model card, accessed June 2026

  67. [69]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

  68. [70]

    Qwen3.5-2B

    Qwen Team. Qwen3.5-2B. https://huggingface.co/Qwen/Qwen3.5-2B, 2025. Model card and benchmark results. Accessed: 2026-06-10

  69. [71]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  70. [72]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025

  71. [73]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

  72. [74]

    Grok-1.5 Vision Preview.https://x.ai/blog/grok-1.5v, 2024

    xAI. Grok-1.5 Vision Preview.https://x.ai/blog/grok-1.5v, 2024. Accessed: 2026-06-10

  73. [75]

    Are we on the right way for evaluating large vision-language models?, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. 66

  74. [76]

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  75. [77]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

  76. [78]

    X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025

  77. [79]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

  78. [80]

    Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

    StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

  79. [81]

    Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

    Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

  80. [82]

    A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692v1, 2026

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Shuailei Ma, He Sun, Yong Wang, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Shuai Zhou, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Qian Zhu, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692v1, 2026

Showing first 80 references.