pith. machine review for the scientific record. sign in

arxiv: 2601.16163 · v1 · submitted 2026-01-22 · 💻 cs.AI · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 14:43 UTC · model grok-4.3

classification 💻 cs.AI cs.RO
keywords video generation modelsvisuomotor controlrobot policyfine-tuningmodel-based planningbimanual manipulationlatent diffusion
0
0 comments X

The pith

Cosmos Policy adapts a pretrained video model into a robot policy through single-stage fine-tuning by encoding actions as latent frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a large video generation model can be turned into a capable robot controller and planner with only one round of fine-tuning on target-platform demonstrations and no new architectural pieces. Robot actions are generated by treating them as latent frames inside the model's existing diffusion process, which lets the system reuse its built-in knowledge of how scenes and objects evolve over time. The same mechanism also produces predicted future images and expected reward values, supporting test-time planning that selects higher-success action sequences. This yields top results on standard simulation suites and challenging real bimanual tasks while beating both scratch-trained diffusion policies and fine-tuned vision-language-action models.

Core claim

Cosmos Policy is a single-stage fine-tuning procedure that encodes robot actions, future state images, and value estimates as latent frames within a pretrained video model's latent diffusion process, thereby converting the model's spatiotemporal priors into a visuomotor policy that supports both direct action generation and model-based planning.

What carries the argument

Encoding robot actions, future states, and values as latent frames inside the video model's latent diffusion process, using its existing training algorithm with no architectural modifications.

If this is right

  • Delivers 98.5 percent average success on the LIBERO benchmark and 67.1 percent on RoboCasa.
  • Outperforms strong diffusion policies, other video-based policies, and state-of-the-art vision-language-action models on the same robot data.
  • Supports iterative improvement by retraining on its own rollout data to refine the world model and value estimates.
  • Enables test-time planning that selects action trajectories with higher predicted success likelihood.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • General video priors could lower the data volume needed to achieve competent robot behavior on new hardware.
  • The latent-frame encoding approach might be tested on domains such as navigation or manipulation with different robot morphologies.
  • If the method scales, it would allow rapid policy adaptation to new environments by collecting modest target demonstrations rather than training entirely new models.

Load-bearing premise

That the video model's pretrained understanding of physical scene dynamics transfers to robot control simply by representing actions and rewards as additional latent frames during fine-tuning.

What would settle it

A new manipulation task suite where a diffusion policy trained from scratch on the same demonstrations matches or exceeds Cosmos Policy's success rate after identical single-stage training.

read the original abstract

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Cosmos Policy, which adapts the pretrained Cosmos-Predict2 video generation model into a visuomotor policy through single-stage fine-tuning on target-platform robot demonstrations. Actions, future states, and scalar value estimates (expected cumulative rewards) are encoded directly as latent frames in the model's latent diffusion process, with no architectural modifications or auxiliary components. This enables both direct policy execution and test-time planning over action trajectories. The work reports state-of-the-art average success rates of 98.5% on LIBERO and 67.1% on RoboCasa simulation benchmarks, plus the highest average score on challenging real-world bimanual manipulation tasks, outperforming diffusion policies trained from scratch, other video-model policies, and fine-tuned vision-language-action models. It further shows that rollout data can be used to refine the world model and value function for improved planning.

Significance. If the empirical claims hold under rigorous validation, the result would demonstrate that large-scale pretrained video models can transfer spatiotemporal priors to robotics with unusually low adaptation overhead, simplifying policy-learning pipelines that currently rely on multi-stage training or custom action heads. The explicit release of code, models, and training data is a clear strength that supports reproducibility and community follow-up. The iterative refinement capability from experience data adds practical value for long-horizon tasks.

major comments (3)
  1. [§3.2] §3.2 (Action, State, and Value Encoding): The central claim that continuous actions and scalar values can be mapped into the visual latent space of Cosmos-Predict2 without dedicated decoders or losses rests on the unexamined assumption that reconstruction fidelity remains sufficient for precise control. No quantitative metrics (e.g., action reconstruction MSE, distribution divergence, or ablation against explicit action heads) are reported, which directly bears on whether single-stage fine-tuning alone suffices.
  2. [Table 2] Table 2 (Real-world bimanual results): The reported highest average score is load-bearing for the 'no extra components' assertion, yet the evaluation protocol (number of trials per task, variance across seeds, and statistical comparison to baselines) is not detailed. Without these, it is impossible to assess whether the gains are attributable to the latent-frame encoding rather than implementation details or evaluation variance.
  3. [§5.1] §5.1 (Learning from Experience): The iterative refinement experiment shows improved success rates after incorporating rollout data, but lacks an ablation isolating the contribution of the value-function latent frames versus the world-model component. This weakens the claim that the unified latent-frame formulation uniquely enables effective model-based planning.
minor comments (2)
  1. [Abstract] The abstract states average success rates without specifying the number of tasks or environments over which the averages are computed; this should be clarified for interpretability.
  2. [Figure 3] Figure 3 (qualitative rollouts) would benefit from explicit annotation of the encoded action frames versus predicted future states to help readers trace the latent-frame pipeline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing honest clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Action, State, and Value Encoding): The central claim that continuous actions and scalar values can be mapped into the visual latent space of Cosmos-Predict2 without dedicated decoders or losses rests on the unexamined assumption that reconstruction fidelity remains sufficient for precise control. No quantitative metrics (e.g., action reconstruction MSE, distribution divergence, or ablation against explicit action heads) are reported, which directly bears on whether single-stage fine-tuning alone suffices.

    Authors: We appreciate this observation on the need for direct evidence of encoding quality. The state-of-the-art results on LIBERO, RoboCasa, and real-world tasks provide indirect validation that the latent-frame mapping supports precise control. To address the concern explicitly, we will add quantitative metrics including action reconstruction MSE, KL divergence on action distributions, and a limited comparison to an explicit action-head baseline in the revised §3.2. revision: yes

  2. Referee: [Table 2] Table 2 (Real-world bimanual results): The reported highest average score is load-bearing for the 'no extra components' assertion, yet the evaluation protocol (number of trials per task, variance across seeds, and statistical comparison to baselines) is not detailed. Without these, it is impossible to assess whether the gains are attributable to the latent-frame encoding rather than implementation details or evaluation variance.

    Authors: We agree that fuller reporting of the evaluation protocol is necessary for rigorous assessment. In the revised manuscript we will expand the caption and surrounding text for Table 2 to specify the number of trials per task, standard deviations across random seeds, and the statistical methods used for baseline comparisons. revision: yes

  3. Referee: [§5.1] §5.1 (Learning from Experience): The iterative refinement experiment shows improved success rates after incorporating rollout data, but lacks an ablation isolating the contribution of the value-function latent frames versus the world-model component. This weakens the claim that the unified latent-frame formulation uniquely enables effective model-based planning.

    Authors: We acknowledge that an explicit ablation would help isolate the roles of the value and world-model components. Our current results demonstrate that joint refinement within the unified latent space yields measurable gains in planning performance. We will add a clarifying discussion in §5.1 on the interplay of these components and note the absence of a full isolation study as a limitation, while arguing that the single-stage unified formulation is what enables the observed iterative improvement. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical fine-tuning with benchmark validation

full rationale

The paper presents an empirical adaptation method: single-stage fine-tuning of Cosmos-Predict2 on robot demonstrations to generate actions, future states, and values encoded as latent frames, with no architectural changes. All performance claims (98.5% on LIBERO, 67.1% on RoboCasa, real-world bimanual results) rest on direct experimental comparisons against baselines, not on any derivation, equation, or prediction that reduces to fitted parameters or self-citations within the paper. The pretrained model is treated as an external input; its latent space properties are not derived here. No load-bearing step invokes a uniqueness theorem, ansatz smuggling, or renaming of known results. This is a standard empirical robotics paper whose central claims are falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of video-model spatiotemporal priors to robot actions via latent-frame encoding and the sufficiency of single-stage fine-tuning on demonstration data; these are domain assumptions rather than derived results.

axioms (1)
  • domain assumption Pretrained video models capture complex physical interactions and scene evolution that transfer to robot visuomotor control when actions are encoded as latent frames.
    Invoked in the abstract as the reason a single stage of post-training suffices without architectural changes.

pith-pipeline@v0.9.0 · 5627 in / 1343 out tokens · 73784 ms · 2026-05-12T14:43:31.949228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  2. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  3. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  4. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  5. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  6. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  7. CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    cs.CV 2026-04 unverdicted novelty 7.0

    CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

  8. Action Images: End-to-End Policy Learning via Multiview Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  9. JailWAM: Jailbreaking World Action Models in Robot Control

    cs.RO 2026-04 unverdicted novelty 7.0

    JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

  10. OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

    cs.CV 2026-05 unverdicted novelty 6.0

    OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

  11. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  12. Data-Asymmetric Latent Imagination and Reranking for 3D Robotic Imitation Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    DALI-R boosts 3D imitation learning success rates by 6.8% on average from suboptimal trajectories via latent imagination and reranking, with under 0.7x inference cost.

  13. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  14. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 6.0

    MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...

  15. MotuBrain: An Advanced World Action Model for Robot Control

    cs.RO 2026-04 unverdicted novelty 6.0

    MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...

  16. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  17. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  18. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  19. XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios

    cs.RO 2026-04 unverdicted novelty 6.0

    XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.

  20. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  21. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  22. Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.

  23. Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

    cs.RO 2026-04 unverdicted novelty 6.0

    Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

  24. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  25. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  26. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  27. Gated Memory Policy

    cs.RO 2026-04 unverdicted novelty 5.0

    GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.

  28. Causal World Modeling for Robot Control

    cs.CV 2026-01 unverdicted novelty 5.0

    LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

  29. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  30. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

  31. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.

  32. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

  33. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 30 Pith papers · 22 internal anchors

  1. [1]

    Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

    Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to- video generator with diffusion models.arXiv preprint arXiv:2405.04233,

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  4. [4]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choroman- ski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,

  5. [5]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111,

  6. [6]

    Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

    Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist bimanual manipulation.arXiv preprint arXiv:2507.12898,

  7. [7]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603,

  8. [8]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with dis- crete world models.arXiv preprint arXiv:2010.02193,

  9. [9]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

  10. [10]

    A dual process vla: Efficient robotic manipulation leveraging vlm.arXiv preprint arXiv:2410.15549,

    11 ByungOk Han, Jaehong Kim, and Jinhyeok Jang. A dual process vla: Efficient robotic manipulation leveraging vlm.arXiv preprint arXiv:2410.15549,

  11. [11]

    Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955,

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955,

  12. [12]

    TD-MPC2: Scalable, Robust World Models for Continuous Control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for contin- uous control.arXiv preprint arXiv:2310.16828,

  13. [13]

    Dita: Scaling diffusion transformer for generalist vision-language-action policy.arXiv preprint arXiv:2503.19757, 2025

    Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, et al. Dita: Scaling diffusion transformer for generalist vision- language-action policy.arXiv preprint arXiv:2503.19757,

  14. [14]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803,

  15. [15]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a Vision-Language-Action Model with Open-World Generalization.arXiv preprint arXiv:2504.16054,

  16. [16]

    A smooth sea never made a skilled sailor: Robust imitation via learning to search.arXiv preprint arXiv:2506.05294,

    Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, San- jiban Choudhury, and Gokul Swamy. A smooth sea never made a skilled sailor: Robust imitation via learning to search.arXiv preprint arXiv:2506.05294,

  17. [17]

    Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705,

  18. [18]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

  19. [19]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Opti- mizing speed and success.arXiv preprint arXiv:2502.19645,

  20. [20]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

  21. [21]

    HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

    Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Youngyo Seo, and Jinwoo Shin. Hamlet: Switch your vision-language-action model into a history-aware policy. arXiv preprint arXiv:2510.00695,

  22. [22]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025a. Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025b. Junbang Liang, Pavel Tokmakov, Ruos...

  23. [23]

    Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

    12 Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

  24. [24]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations.arXiv preprint arXiv:2310.17596,

  25. [25]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for gener- alist robots.arXiv preprint arXiv:2406.02523,

  26. [26]

    Cosmos World Foundation Model Platform for Physical AI

    URL https://arxiv.org/abs/2501.03575. William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205,

  27. [27]

    Dyna, an integrated architecture for learning, planning, and reacting,

    ISSN 0163-5719. doi: 10.1145/122344.122377. URLhttps: //doi.org/10.1145/122344.122377. Unitree. Unifolm-wma-0: A world-model-action (wma) framework under unifolm family,

  28. [28]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  29. [29]

    Latent policy steering with embodiment-agnostic pretrained world models.arXiv preprint arXiv:2507.13340, 2025

    Yiqi Wang, Mrinal Verghese, and Jeff Schneider. Latent policy steering with embodiment-agnostic pretrained world models.arXiv preprint arXiv:2507.13340,

  30. [30]

    Dual-stream diffusion for world-model augmented vision-language-action model, 2025

    13 John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, and Jinwoo Shin. Dual-stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607,

  31. [31]

    Roboenvision: A long-horizon video generation model for multi-task robot manipulation.arXiv preprint arXiv:2506.22007, 2025

    Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, and Abhinav Valada. Roboenvision: A long-horizon video generation model for multi-task robot manipulation.arXiv preprint arXiv:2506.22007,

  32. [32]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072,

  33. [33]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

  34. [34]

    Flare: Robot learning with implicit world modeling, 2025

    Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling. arXiv preprint arXiv:2505.15659,

  35. [35]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404,

  36. [36]

    Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025

    Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Wenxuan Song, Jiayi Chen, and Haoang Li. Flowvla: Thinking in motion with a visual chain of thought.arXiv preprint arXiv:2508.18269,

  37. [37]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792,

  38. [38]

    Figure 8 provides a detailed visualization of how latent injection operates within Cosmos Policy’s latent diffusion sequence

    14 A APPENDIX A.1 LATENTINJECTIONIMPLEMENTATIONDETAILS As discussed in Section 4.1, Cosmos Policy learns to condition on and generate new non-image modalities (such as actions, robot proprioception, and values) through a mechanism calledlatent injection. Figure 8 provides a detailed visualization of how latent injection operates within Cosmos Policy’s lat...

  39. [39]

    put X on plate

    This higher lower bound empirically improves prediction accuracy at inference time for actions, future states, and values, as measured by lower L1 loss on training and validation samples. 16 A.2.2 LIBERO TRAININGDETAILS We train Cosmos Policy for 40K gradient steps using 64 H100 GPUs with global batch size 1920 (taking 48 hours total). The base Cosmos-Pre...