Cosmos 3: Omnimodal World Models for Physical AI

Aarti Basant; Adeline Aubame; Aigul Dzhumamuratova; Akash Gokul; Aleksandr Efitorov; Alexander Sotelo; Alice Luo; Ali Hassani; Alisson Azzolini; Alperen Degirmenci

arxiv: 2606.02800 · v4 · pith:B43Y2PK5new · submitted 2026-06-01 · 💻 cs.CV · cs.AI· cs.LG· cs.MM· cs.RO

Cosmos 3: Omnimodal World Models for Physical AI

NVIDIA: Aditi , Niket Agarwal , Arslan Ali , Jon Allen , Martin Antolini , Adeline Aubame , Alisson Azzolini , Junjie Bai

show 286 more authors

Maciej Bala Yogesh Balaji Josh Bapst Aarti Basant Mukesh Beladiya Mohammad Qazim Bhat Zaid Pervaiz Bhat Dan Blick Vanni Brighella Han Cai Tiffany Cai Eric Cameracci Jiaxin Cao Yulong Cao Mark Carlson Carlos Casanova Ting-Yun Chang Yan Chang Yu-Wei Chao Prithvijit Chattopadhyay Roshan Chaudhari Chieh-Yun Chen Junyu Chen Ke Chen Qizhi Chen Wenkai Chen Xiaotong Chen Yu Chen An-Chieh Cheng Click Cheng Xiu Chia Jeana Choi Chaeyeon Chung Wenyan Cong Yin Cui Magdalena Dadela Nalin Dadhich Wenliang Dai Joyjit Daw Alperen Degirmenci Rodrigo Vieira Del Monte Robert Denomme Sameer Dharur Marco Di Lucca Ke Ding Wenhao Ding Yifan Ding Yuzhu Dong Nicole Drumheller Yilun Du Aigul Dzhumamuratova Aleksandr Efitorov Hamid Eghbalzadeh Naomi Eigbe Imad El Hanafi Hassan Eslami Benedikt Falk Jiaojiao Fan Jim Fan Amol Fasale Sergiy Fefilatyev Liang Feng Francesco Ferroni Sanja Fidler Xiao Fu Vikram Fugro Prashant Gaikwad TJ Galda Katelyn Gao Yihuai Gao Wenhang Ge Sreyan Ghosh Arushi Goel Vivek Goel Akash Gokul Rama Govindaraju Jinwei Gu Miguel Guerrero Elfie Guo Aryaman Gupta Siddharth Gururani Hugo Hadfield Song Han Ankur Handa Zekun Hao Mohammad Harrim Ali Hassani Nathan Hayes-Roth Yufan He Chris Helvig Cyrus Hogg Madison Huang Michael Huang Sophia Huang Yufan Huang Jacob Huffman DeLesley Hutchins Suneel Indupuru Boris Ivanovic Arihant Jain Joel Jang Ryan Ji Yanan Jian Dongfu Jiang Jingyi Jin Atharva Joshi Nikhilesh Joshi Pranjali Joshi Andy Ju Jaehun Jung Weiwei Kang Scott Kassekert Jan Kautz Ashna Khetan Julia Kiczka Slawek Kierat Gwanghyun Kim Kuno Kim Sunny Kim Kezhi Kong Xin Kong Zhifeng Kong Tomasz Kornuta Egor Krivov Hui Kuang Saurav Kumar Chia-Wen Kuo George Kurian Wojciech Kutak JF Lafleche Himangshu Lahkar Omar Laymoun Jayjun Lee Sanggil Lee Gabriele Leone Boyi Li Freya Li Jiajun Li Jinfeng Li Ling Li Pengcheng Li Shangru Li Tingle Li Xiaolong Li Xuan Li Zhaoshuo Li Zhiqi Li Hao Liang Maosheng Liao Chen-Hsuan Lin Tsung-Yi Lin Ming-Yu Liu Sifei Liu Zihan Liu Hai Loc Lu Xiangyu Lu Alice Luo Ruipu Luo Wenjie Luo Jiangran Lyu Martin Ding Ma Nic Ma Qianli Ma Dawid Majchrowski Louis Marcoux Miguel Martin Qing Miao Ashkan Mirzaei Shreyas Misra Kaichun Mo Durra Mohsin Hyejin Moon Pawel Morkisz Saeid Motiian Kirill Motkov Seungjun Nah Yashraj Narang Deepak Narayanan Thabang Ngazimbi Julian Ouyang Shubham Pachori David Page Yatian Pang Sehwi Park Mahesh Patekar Mostofa Patwary Marco Pavone Trung Pham Wei Ping Soha Pouya Shrimai Prabhumoye Varun Praveen Delin Qu Hesam Rabeti Morteza Ramezanali Marilyn Reeb Xuanchi Ren Kristen Rumley Wojciech Rymer Jun Saito Yeongho Seol John Shao Piyush Shekdar Tianwei Shen Humphrey Shi Min Shi Stella Shi Kevin Shih Mohammad Shoeybi Mateusz Sieniawski Shuran Song Alexander Sotelo Amir Sotoodeh Sunil Srinivasa Vignesh Srinivasakumar Bartosz Stefaniak Rahul Heinrich Steiger Shangkun Sun Jiaxiang Tang Shitao Tang Yangyang Tang Yue Tang Tolou Tavakkoli Kayley Ting Krzysztof Tomala Wei-Cheng Tseng Jibin Varghese Sergei Vasilev Thomas Volk Raju Wagwani Roger Waleffe Andrew Z. Wang Boxiang Wang Haoxiang Wang Qiao Wang Shihao Wang Shijie Wang Ting-Chun Wang Yan Wang Yu Wang Rohit Watve David Wehr Fangyin Wei Xinshuo Weng Jay Zhangjie Wu Kedi Wu Hongchi Xia Summer Xiao Tianjun Xiao Kevin Xie Daguang Xu Jiashu Xu Mengyao Xu Ruqing Xu Xingqian Xu Yao Xu Dinghao Yang Dong Yang Hans Yang Xiaodong Yang Xuning Yang Yichu Yang Yurong You Zhiding Yu Hao Yuan Simon Yuen Xiaohui Zeng Pengcuo Zeren Cindy Zha Haotian Zhang Jenny Zhang Jing Zhang Liangkai Zhang Paris Zhang Shun Zhang Xuanmeng Zhang Zhizheng Zhang Ann Zhao Yilin Zhao Yuliya Zhautouskaya Charles Zhou Fengzhe Zhou Shilin Zhu Yuke Zhu Dima Zhylko Artur Zolkowski

This is my paper

classification 💻 cs.CV cs.AIcs.LGcs.MMcs.RO

keywords modelscosmosworldhttpsnvidiaomnimodalphysicalavailable

0 comments

read the original abstract

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models
cs.CV 2026-06 unverdicted novelty 6.0

Causal-rCM unifies teacher-forcing and self-forcing distillation for autoregressive video diffusion, delivering a 2-step model with VBench-T2V score 84.63 and enabling interactive world models on Cosmos 3 using only s...
DiffusionBench: On Holistic Evaluation of Diffusion Transformers
cs.CV 2026-06 conditional novelty 6.0

NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
cs.CV 2026-06 unverdicted novelty 6.0

ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation
cs.RO 2026-06 unverdicted novelty 6.0

SC3-Eval enforces three consistency constraints on video world models to evaluate robot manipulation policies, achieving 0.929 Pearson correlation with real-world rollouts across seven policies.
ActWorld: From Explorable to Interactive World Model via Action-Aware Memory
cs.CV 2026-06 unverdicted novelty 6.0

ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated in...
Learning Action Priors for Cross-embodiment Robot Manipulation
cs.RO 2026-06 unverdicted novelty 5.0

A two-stage framework pretrains an action module with temporal motion priors from unconditioned trajectories using flow-matching, then transfers it to VLA training via decoder reuse and distillation, yielding better p...
Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation
cs.CV 2026-06 unverdicted novelty 5.0

Sol Video Inference Engine uses parallel skill agents to optimize cache, sparse attention, token pruning, quantization, and kernel fusion, delivering over 2x end-to-end acceleration with near-lossless quality on three...
Physics-IQ Verified
cs.CV 2026-06 unverdicted novelty 5.0

Physics-IQ Verified refines 57.6% of samples and 34.8% of prompts from the original benchmark and produces moderate ranking shifts (Kendall's τ = 0.46) across six image-to-video models.
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 5.0

PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.
What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory
cs.AI 2026-06 unverdicted novelty 5.0

Geometry-led weighting outperforms blended memory recall for spatial queries, and a DDA-based visibility predicate correctly flags occluded targets while recall remains occlusion-blind.
Critique of Agent Model
cs.AI 2026-06 unverdicted novelty 4.0

Distinguishes agentic (externally scaffolded) from agentive (internally structured) AI systems and proposes the Goal-Identity-Configurator architecture for endogenous autonomy.