Cosmos 3: Omnimodal World Models for Physical AI

· 2026 · cs.CV · arXiv 2606.02800

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

representative citing papers

RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.

ROSA: A Robotics Foundation Model Serving System for Robot Factories

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

ROSA introduces shared GPU-pool serving, robotics-aware abstractions for multi-model pipelines, and factory-productivity scheduling that improves output by up to 12.06x over dedicated per-robot systems.

Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers

cs.CV · 2026-06-27 · unverdicted · novelty 6.0

Mural transfers knowledge from a frozen LLM to text-to-image synthesis via MoT shared attention, achieving 0.85 GenEval, 86.75 DPG-Bench, and 0.66 WISE while exhibiting emergent behaviors without multimodal or reasoning supervision.

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

cs.RO · 2026-06-17 · unverdicted · novelty 6.0

SC3-Eval enforces three consistencies on a video model to produce policy rollouts that correlate 0.929 with real-world performance across seven vision-language-action policies and reproduce observed failure modes.

World Narrative Model for Highly Controllable Video Generation: A Paradigm Shift from Pixel Sampling to Physical World Orchestration

cs.CV · 2026-06-30 · unverdicted · novelty 5.0

WNM introduces a 4D world narrative representation orchestrated by agents to drive video foundation models for high controllability.

PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation

cs.CV · 2026-06-26 · unverdicted · novelty 5.0

PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates from 16% to 24%.

citing papers explorer

Showing 6 of 6 citing papers after filters.

RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis cs.RO · 2026-06-22 · unverdicted · none · ref 7 · internal anchor
RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.
ROSA: A Robotics Foundation Model Serving System for Robot Factories cs.RO · 2026-07-01 · unverdicted · none · ref 1 · internal anchor
ROSA introduces shared GPU-pool serving, robotics-aware abstractions for multi-model pipelines, and factory-productivity scheduling that improves output by up to 12.06x over dedicated per-robot systems.
Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers cs.CV · 2026-06-27 · unverdicted · none · ref 20 · internal anchor
Mural transfers knowledge from a frozen LLM to text-to-image synthesis via MoT shared attention, achieving 0.85 GenEval, 86.75 DPG-Bench, and 0.66 WISE while exhibiting emergent behaviors without multimodal or reasoning supervision.
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation cs.RO · 2026-06-17 · unverdicted · none · ref 23 · internal anchor
SC3-Eval enforces three consistencies on a video model to produce policy rollouts that correlate 0.929 with real-world performance across seven vision-language-action policies and reproduce observed failure modes.
World Narrative Model for Highly Controllable Video Generation: A Paradigm Shift from Pixel Sampling to Physical World Orchestration cs.CV · 2026-06-30 · unverdicted · none · ref 20 · internal anchor
WNM introduces a 4D world narrative representation orchestrated by agents to drive video foundation models for high controllability.
PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation cs.CV · 2026-06-26 · unverdicted · none · ref 1 · internal anchor
PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates from 16% to 24%.

Cosmos 3: Omnimodal World Models for Physical AI

fields

years

verdicts

representative citing papers

citing papers explorer