pith. sign in

arxiv: 2606.31101 · v1 · pith:SYUC6JHPnew · submitted 2026-06-30 · 💻 cs.RO

Efficient Sim-to-Real Transfer of World-Action Models from Synthetic Priors

Pith reviewed 2026-07-01 05:49 UTC · model grok-4.3

classification 💻 cs.RO
keywords sim-to-real transferworld-action modelsrobotic manipulationsynthetic datavideo diffusionzero-shot deploymentFranka robot
0
0 comments X

The pith

A world-action model trained only on synthetic data transfers zero-shot to a real Franka robot at 35 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that world-action models, which generate both future visual states and actions, can be trained entirely from simulation and then run directly on physical hardware with no real demonstrations. This would matter because real-robot data collection remains costly, so successful synthetic priors could make scalable learning of manipulation skills feasible. The authors adapt a video diffusion model for control, populate randomized simulation scenes, and generate trajectories via a motion-planning pipeline to produce roughly 800 demonstrations per task. They then deploy the resulting policy on lifting, drawer opening, and pick-and-place tasks, recording a 35 percent average success rate on the physical robot.

Core claim

By training a video diffusion policy on synthetic demonstrations generated in heavily randomized simulation environments using an existing motion-planning pipeline, the authors obtain a world-action model that, when deployed zero-shot on a Franka robot, reaches 35 percent average success across three manipulation tasks and constitutes the first reported successful sim-to-real transfer of such a model.

What carries the argument

Cosmos Policy, a video diffusion model adapted to output both future images and actions for visuomotor control, trained on domain-randomized simulation data and AnyTask-generated demonstrations.

If this is right

  • Only synthetic data is needed to train a deployable world-action model for these manipulation tasks.
  • Zero-shot deployment on a physical arm is possible without any real-world fine-tuning or demonstrations.
  • The same training recipe applies across object lifting, drawer opening, and pick-and-place.
  • Roughly 800 synthetic demonstrations per task suffice to reach the reported 35 percent success level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If simulation fidelity and randomization improve further, the same pipeline could support a wider range of contact-rich tasks.
  • The approach opens the possibility of pre-training world-action models on massive synthetic corpora before any hardware contact.
  • Success rates might rise if the policy is allowed light online adaptation on the real robot after the zero-shot phase.

Load-bearing premise

The combination of domain randomization in simulation and motion-planning demonstrations produces data distributions close enough to real conditions for zero-shot transfer to succeed.

What would settle it

Running the trained policy on the same Franka robot and measuring near-zero success rates across the three tasks would show that the sim-to-real transfer did not occur.

Figures

Figures reproduced from arXiv: 2606.31101 by Jinghuan Shang, Karl Schmeckpeper, Kausik Sivakumar, Ran Gong, Xiaohan Zhang, Yafei Hu, Zhaoming Xie, Zixing Wang.

Figure 1
Figure 1. Figure 1: Real-world zero-shot RGB sim-to-real rollouts on a Franka Research 3 for three representative tasks, shown left to right: lift banana, lift brick, and open drawer. The same policy family is trained only from synthetic demonstrations and deployed directly on the real robot. Abstract Bridging the sim-to-real gap is a core challenge in deploy￾ing learned manipulation policies. Sim-to-real learning is attracti… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison between model-predicted and synchronized live camera observations during [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Out-of-distribution generalization on a real robot: al [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

Bridging the sim-to-real gap is a core challenge in deploying learned manipulation policies. Sim-to-real learning is attractive because it can replace expensive real robot demonstrations with scalable synthetic data, yet world-action models have not previously been shown to transfer from simulation to real robotic manipulation. We study whether a world-action model can be trained from synthetic priors and deployed zero-shot in the real world. To this end, we build upon Cosmos Policy, a video diffusion model adapted for visuomotor control. We construct simulation environments with extensive domain randomization and generate demonstrations using the AnyTask motion planning pipeline. We evaluate our approach across object lifting, drawer opening, and pick-and-place tasks using ${\sim}800$ synthetic demonstrations per task and no real demonstrations. When deployed zero-shot on a Franka Robot, our policy attains a 35\% average success rate. To our knowledge, this represents the first successful sim-to-real transfer of a world-action model for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to demonstrate the first successful zero-shot sim-to-real transfer of a world-action model (Cosmos Policy, a video diffusion model adapted for visuomotor control) for robotic manipulation. Using only synthetic data (~800 domain-randomized demonstrations per task generated via the AnyTask motion planning pipeline in simulation), the approach trains a policy that achieves a 35% average success rate when deployed on a real Franka robot for object lifting, drawer opening, and pick-and-place tasks, with no real demonstrations used.

Significance. If the empirical results and transfer assumption hold after verification, the work would be significant for robotic learning by showing that synthetic priors alone can enable deployment of diffusion-based world-action models in real manipulation without real-robot data collection. This addresses a key scalability challenge in sim-to-real transfer and extends domain randomization techniques to this model class. The paper receives credit for focusing on a falsifiable empirical outcome (success rate on physical hardware) and for targeting a previously unshown transfer setting.

major comments (2)
  1. [Abstract] Abstract: The headline result of a 35% average success rate is presented without baselines, variance or standard deviation across trials, number of evaluation rollouts, task-specific success criteria, or evaluation protocol details. These omissions make it impossible to determine whether the rate reflects meaningful transfer or could be explained by task simplicity or chance.
  2. [Abstract] Abstract: The zero-shot sim-to-real claim is load-bearing on the assumption that the ~800 AnyTask-generated, domain-randomized trajectories lie inside the support of the real Franka data distribution for the Cosmos Policy. No quantitative support (image statistics, action histograms, friction/contact dynamics mismatch, or distribution divergence metrics) is referenced, leaving the attribution of the 35% rate to successful world-action model transfer unverified.
minor comments (1)
  1. [Abstract] Abstract: The description of Cosmos Policy as 'a video diffusion model adapted for visuomotor control' would benefit from a one-sentence architectural summary or citation to the base model to aid readers unfamiliar with the prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses and indicate where revisions have been made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline result of a 35% average success rate is presented without baselines, variance or standard deviation across trials, number of evaluation rollouts, task-specific success criteria, or evaluation protocol details. These omissions make it impossible to determine whether the rate reflects meaningful transfer or could be explained by task simplicity or chance.

    Authors: We agree that additional statistical and methodological details are required for proper interpretation of the headline result. In the revised manuscript, the abstract has been expanded to report the number of evaluation rollouts (20 per task), the standard deviation across trials, and explicit references to the task-specific success criteria and evaluation protocol now detailed in Section 4 and the supplementary material. Baselines are discussed in the main experiments section rather than the abstract due to length constraints. revision: yes

  2. Referee: [Abstract] Abstract: The zero-shot sim-to-real claim is load-bearing on the assumption that the ~800 AnyTask-generated, domain-randomized trajectories lie inside the support of the real Franka data distribution for the Cosmos Policy. No quantitative support (image statistics, action histograms, friction/contact dynamics mismatch, or distribution divergence metrics) is referenced, leaving the attribution of the 35% rate to successful world-action model transfer unverified.

    Authors: We acknowledge the value of quantitative distribution alignment metrics to further support the transfer assumption. The original submission relied on the established practice of domain randomization (as in prior sim-to-real works) without explicit metrics. We have added a dedicated limitations paragraph in the revised discussion section that explicitly states this assumption and its reliance on randomization coverage, while noting that full divergence metrics would require new experiments. The primary evidence remains the zero-shot real-robot success rate achieved with no real data. revision: partial

Circularity Check

0 steps flagged

No circularity detected; result is direct empirical measurement with no derivation reducing to inputs by construction

full rationale

The paper presents an empirical outcome: a 35% average success rate achieved by deploying a diffusion policy trained solely on ~800 domain-randomized synthetic demonstrations per task, with zero real data. No equations, fitted parameters, predictions, or self-citations are referenced in the abstract or described text that would reduce the reported success rate or the 'first successful transfer' claim to a tautology or prior fit. The load-bearing assumption about synthetic-to-real distribution closeness is an unverified premise affecting validity, not a circular step in any derivation chain. This matches the default case of a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes simulation with domain randomization is representative of reality.

axioms (1)
  • domain assumption Simulation environments with domain randomization produce data distributions close enough to reality for zero-shot policy transfer.
    Invoked by the decision to train exclusively on synthetic data and deploy directly on the real robot.

pith-pipeline@v0.9.1-grok · 5720 in / 1239 out tokens · 29739 ms · 2026-07-01T05:49:18.770921+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    International Conference on Learning Representations , year=

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning , author=. International Conference on Learning Representations , year=

  2. [2]

    arXiv preprint arXiv:2512.17853 , year=

    AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning , author=. arXiv preprint arXiv:2512.17853 , year=

  3. [3]

    Cosmos World Foundation Model Platform for Physical

    NVIDIA and Agarwal, Niket and Ali, Ahmed and others , journal=. Cosmos World Foundation Model Platform for Physical

  4. [4]

    Proceedings of Robotics: Science and Systems (RSS) , year=

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author=. Proceedings of Robotics: Science and Systems (RSS) , year=

  5. [5]

    Kim, Moo Jin and Pertsch, Karl and Karamcheti, Siddharth and Xiao, Ted and Balakrishna, Ashwin and Nair, Suraj and Rafailov, Rafael and Foster, Ethan and Lam, Grace and Sanketi, Pannag and Vuong, Quan and Kollar, Thomas and Burchfiel, Benjamin and Tedrake, Russ and Sadigh, Dorsa and Levine, Sergey and Liang, Percy and Finn, Chelsea , journal=. Open

  6. [6]

    IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year=

    Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World , author=. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year=

  7. [7]

    Solving Rubik's Cube with a Robot Hand

    Solving. arXiv preprint arXiv:1910.07113 , year=

  8. [8]

    World Models

    World Models , author=. arXiv preprint arXiv:1803.10122 , year=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    Learning Universal Policies via Text-Guided Video Generation , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Chen, Xi and Choromanski, Krzysztof and Ding, Tianli and Driess, Danny and Dubey, Avinava and Finn, Chelsea and others , booktitle=

  11. [11]

    Makoviychuk, Viktor and Wawrzyniak, Lukasz and Guo, Yunrong and Lu, Michelle and Storey, Kier and Macklin, Miles and Hoeller, David and Rudin, Nikita and Allshire, Arthur and Handa, Ankur and State, Gavriel , journal=

  12. [12]

    arXiv preprint arXiv:2009.13303 , year=

    Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey , author=. arXiv preprint arXiv:2009.13303 , year=