Cosmos 3: Omnimodal World Models for Physical AI
read the original abstract
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.
This paper has not been read by Pith yet.
Forward citations
Cited by 11 Pith papers
-
Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models
Causal-rCM unifies teacher-forcing and self-forcing distillation for autoregressive video diffusion, delivering a 2-step model with VBench-T2V score 84.63 and enabling interactive world models on Cosmos 3 using only s...
-
DiffusionBench: On Holistic Evaluation of Diffusion Transformers
NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.
-
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.
-
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation
SC3-Eval enforces three consistency constraints on video world models to evaluate robot manipulation policies, achieving 0.929 Pearson correlation with real-world rollouts across seven policies.
-
ActWorld: From Explorable to Interactive World Model via Action-Aware Memory
ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated in...
-
Learning Action Priors for Cross-embodiment Robot Manipulation
A two-stage framework pretrains an action module with temporal motion priors from unconditioned trajectories using flow-matching, then transfers it to VLA training via decoder reuse and distillation, yielding better p...
-
Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation
Sol Video Inference Engine uses parallel skill agents to optimize cache, sparse attention, token pruning, quantization, and kernel fusion, delivering over 2x end-to-end acceleration with near-lossless quality on three...
-
Physics-IQ Verified
Physics-IQ Verified refines 57.6% of samples and 34.8% of prompts from the original benchmark and produces moderate ranking shifts (Kendall's τ = 0.46) across six image-to-video models.
-
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation
PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.
-
What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory
Geometry-led weighting outperforms blended memory recall for spatial queries, and a DDA-based visibility predicate correctly flags occluded targets while recall remains occlusion-blind.
-
Critique of Agent Model
Distinguishes agentic (externally scaffolded) from agentive (internally structured) AI systems and proposes the Goal-Identity-Configurator architecture for endogenous autonomy.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.