Efficient Sim-to-Real Transfer of World-Action Models from Synthetic Priors

Jinghuan Shang; Karl Schmeckpeper; Kausik Sivakumar; Ran Gong; Xiaohan Zhang; Yafei Hu; Zhaoming Xie; Zixing Wang

arxiv: 2606.31101 · v1 · pith:SYUC6JHPnew · submitted 2026-06-30 · 💻 cs.RO

Efficient Sim-to-Real Transfer of World-Action Models from Synthetic Priors

Zixing Wang , Kausik Sivakumar , Jinghuan Shang , Yafei Hu , Zhaoming Xie , Ran Gong , Xiaohan Zhang , Karl Schmeckpeper This is my paper

Pith reviewed 2026-07-01 05:49 UTC · model grok-4.3

classification 💻 cs.RO

keywords sim-to-real transferworld-action modelsrobotic manipulationsynthetic datavideo diffusionzero-shot deploymentFranka robot

0 comments

The pith

A world-action model trained only on synthetic data transfers zero-shot to a real Franka robot at 35 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that world-action models, which generate both future visual states and actions, can be trained entirely from simulation and then run directly on physical hardware with no real demonstrations. This would matter because real-robot data collection remains costly, so successful synthetic priors could make scalable learning of manipulation skills feasible. The authors adapt a video diffusion model for control, populate randomized simulation scenes, and generate trajectories via a motion-planning pipeline to produce roughly 800 demonstrations per task. They then deploy the resulting policy on lifting, drawer opening, and pick-and-place tasks, recording a 35 percent average success rate on the physical robot.

Core claim

By training a video diffusion policy on synthetic demonstrations generated in heavily randomized simulation environments using an existing motion-planning pipeline, the authors obtain a world-action model that, when deployed zero-shot on a Franka robot, reaches 35 percent average success across three manipulation tasks and constitutes the first reported successful sim-to-real transfer of such a model.

What carries the argument

Cosmos Policy, a video diffusion model adapted to output both future images and actions for visuomotor control, trained on domain-randomized simulation data and AnyTask-generated demonstrations.

If this is right

Only synthetic data is needed to train a deployable world-action model for these manipulation tasks.
Zero-shot deployment on a physical arm is possible without any real-world fine-tuning or demonstrations.
The same training recipe applies across object lifting, drawer opening, and pick-and-place.
Roughly 800 synthetic demonstrations per task suffice to reach the reported 35 percent success level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If simulation fidelity and randomization improve further, the same pipeline could support a wider range of contact-rich tasks.
The approach opens the possibility of pre-training world-action models on massive synthetic corpora before any hardware contact.
Success rates might rise if the policy is allowed light online adaptation on the real robot after the zero-shot phase.

Load-bearing premise

The combination of domain randomization in simulation and motion-planning demonstrations produces data distributions close enough to real conditions for zero-shot transfer to succeed.

What would settle it

Running the trained policy on the same Franka robot and measuring near-zero success rates across the three tasks would show that the sim-to-real transfer did not occur.

Figures

Figures reproduced from arXiv: 2606.31101 by Jinghuan Shang, Karl Schmeckpeper, Kausik Sivakumar, Ran Gong, Xiaohan Zhang, Yafei Hu, Zhaoming Xie, Zixing Wang.

**Figure 1.** Figure 1: Real-world zero-shot RGB sim-to-real rollouts on a Franka Research 3 for three representative tasks, shown left to right: lift banana, lift brick, and open drawer. The same policy family is trained only from synthetic demonstrations and deployed directly on the real robot. Abstract Bridging the sim-to-real gap is a core challenge in deploying learned manipulation policies. Sim-to-real learning is attracti… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison between model-predicted and synchronized live camera observations during [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Out-of-distribution generalization on a real robot: al [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Bridging the sim-to-real gap is a core challenge in deploying learned manipulation policies. Sim-to-real learning is attractive because it can replace expensive real robot demonstrations with scalable synthetic data, yet world-action models have not previously been shown to transfer from simulation to real robotic manipulation. We study whether a world-action model can be trained from synthetic priors and deployed zero-shot in the real world. To this end, we build upon Cosmos Policy, a video diffusion model adapted for visuomotor control. We construct simulation environments with extensive domain randomization and generate demonstrations using the AnyTask motion planning pipeline. We evaluate our approach across object lifting, drawer opening, and pick-and-place tasks using ${\sim}800$ synthetic demonstrations per task and no real demonstrations. When deployed zero-shot on a Franka Robot, our policy attains a 35\% average success rate. To our knowledge, this represents the first successful sim-to-real transfer of a world-action model for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a 35% success rate on real Franka manipulation tasks from a diffusion world-action model trained only on synthetic data, but supplies no baselines or checks on whether the sim data actually matches real distributions.

read the letter

The main takeaway is the reported 35% average success on object lifting, drawer opening, and pick-and-place when a Cosmos Policy diffusion model is trained entirely in simulation and run zero-shot on a real Franka. They generate roughly 800 trajectories per task using the AnyTask motion planner inside a domain-randomized simulator and use no real demonstrations.

What is new is the specific result of transferring this class of world-action model across the sim-to-real gap for manipulation without any real-robot data. The approach of combining domain randomization with planner-generated trajectories is a straightforward way to produce volume, and the abstract positions the outcome as the first such transfer.

The paper does a reasonable job of stating the setup clearly in the abstract: the tasks, the data source, and the zero-shot deployment. That alone makes the claim easy to understand at a high level.

The soft spots are more substantial. The abstract gives no baselines, no trial counts or variance for the 35% figure, and no description of exact success criteria or task variations. More critically, there is no quantitative evidence that the synthetic data distribution is close enough to real conditions in lighting, friction, or contact dynamics for the policy to transfer for the reasons claimed. Without that, the success rate could reflect task-specific luck rather than reliable sim-to-real transfer.

This work would interest researchers focused on diffusion policies and scalable data generation for robotics. A reader would get value from seeing whether the full methods and results sections close the gaps in evaluation and distribution matching.

I would send it to peer review. The claim is worth a closer look once the missing details are supplied, even if the current evidence leaves the central transfer result under-supported.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to demonstrate the first successful zero-shot sim-to-real transfer of a world-action model (Cosmos Policy, a video diffusion model adapted for visuomotor control) for robotic manipulation. Using only synthetic data (~800 domain-randomized demonstrations per task generated via the AnyTask motion planning pipeline in simulation), the approach trains a policy that achieves a 35% average success rate when deployed on a real Franka robot for object lifting, drawer opening, and pick-and-place tasks, with no real demonstrations used.

Significance. If the empirical results and transfer assumption hold after verification, the work would be significant for robotic learning by showing that synthetic priors alone can enable deployment of diffusion-based world-action models in real manipulation without real-robot data collection. This addresses a key scalability challenge in sim-to-real transfer and extends domain randomization techniques to this model class. The paper receives credit for focusing on a falsifiable empirical outcome (success rate on physical hardware) and for targeting a previously unshown transfer setting.

major comments (2)

[Abstract] Abstract: The headline result of a 35% average success rate is presented without baselines, variance or standard deviation across trials, number of evaluation rollouts, task-specific success criteria, or evaluation protocol details. These omissions make it impossible to determine whether the rate reflects meaningful transfer or could be explained by task simplicity or chance.
[Abstract] Abstract: The zero-shot sim-to-real claim is load-bearing on the assumption that the ~800 AnyTask-generated, domain-randomized trajectories lie inside the support of the real Franka data distribution for the Cosmos Policy. No quantitative support (image statistics, action histograms, friction/contact dynamics mismatch, or distribution divergence metrics) is referenced, leaving the attribution of the 35% rate to successful world-action model transfer unverified.

minor comments (1)

[Abstract] Abstract: The description of Cosmos Policy as 'a video diffusion model adapted for visuomotor control' would benefit from a one-sentence architectural summary or citation to the base model to aid readers unfamiliar with the prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses and indicate where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result of a 35% average success rate is presented without baselines, variance or standard deviation across trials, number of evaluation rollouts, task-specific success criteria, or evaluation protocol details. These omissions make it impossible to determine whether the rate reflects meaningful transfer or could be explained by task simplicity or chance.

Authors: We agree that additional statistical and methodological details are required for proper interpretation of the headline result. In the revised manuscript, the abstract has been expanded to report the number of evaluation rollouts (20 per task), the standard deviation across trials, and explicit references to the task-specific success criteria and evaluation protocol now detailed in Section 4 and the supplementary material. Baselines are discussed in the main experiments section rather than the abstract due to length constraints. revision: yes
Referee: [Abstract] Abstract: The zero-shot sim-to-real claim is load-bearing on the assumption that the ~800 AnyTask-generated, domain-randomized trajectories lie inside the support of the real Franka data distribution for the Cosmos Policy. No quantitative support (image statistics, action histograms, friction/contact dynamics mismatch, or distribution divergence metrics) is referenced, leaving the attribution of the 35% rate to successful world-action model transfer unverified.

Authors: We acknowledge the value of quantitative distribution alignment metrics to further support the transfer assumption. The original submission relied on the established practice of domain randomization (as in prior sim-to-real works) without explicit metrics. We have added a dedicated limitations paragraph in the revised discussion section that explicitly states this assumption and its reliance on randomization coverage, while noting that full divergence metrics would require new experiments. The primary evidence remains the zero-shot real-robot success rate achieved with no real data. revision: partial

Circularity Check

0 steps flagged

No circularity detected; result is direct empirical measurement with no derivation reducing to inputs by construction

full rationale

The paper presents an empirical outcome: a 35% average success rate achieved by deploying a diffusion policy trained solely on ~800 domain-randomized synthetic demonstrations per task, with zero real data. No equations, fitted parameters, predictions, or self-citations are referenced in the abstract or described text that would reduce the reported success rate or the 'first successful transfer' claim to a tautology or prior fit. The load-bearing assumption about synthetic-to-real distribution closeness is an unverified premise affecting validity, not a circular step in any derivation chain. This matches the default case of a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes simulation with domain randomization is representative of reality.

axioms (1)

domain assumption Simulation environments with domain randomization produce data distributions close enough to reality for zero-shot policy transfer.
Invoked by the decision to train exclusively on synthetic data and deploy directly on the real robot.

pith-pipeline@v0.9.1-grok · 5720 in / 1239 out tokens · 29739 ms · 2026-07-01T05:49:18.770921+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · 2 internal anchors

[1]

International Conference on Learning Representations , year=

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning , author=. International Conference on Learning Representations , year=
[2]

arXiv preprint arXiv:2512.17853 , year=

AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning , author=. arXiv preprint arXiv:2512.17853 , year=

work page arXiv
[3]

Cosmos World Foundation Model Platform for Physical

NVIDIA and Agarwal, Niket and Ali, Ahmed and others , journal=. Cosmos World Foundation Model Platform for Physical
[4]

Proceedings of Robotics: Science and Systems (RSS) , year=

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author=. Proceedings of Robotics: Science and Systems (RSS) , year=
[5]

Kim, Moo Jin and Pertsch, Karl and Karamcheti, Siddharth and Xiao, Ted and Balakrishna, Ashwin and Nair, Suraj and Rafailov, Rafael and Foster, Ethan and Lam, Grace and Sanketi, Pannag and Vuong, Quan and Kollar, Thomas and Burchfiel, Benjamin and Tedrake, Russ and Sadigh, Dorsa and Levine, Sergey and Liang, Percy and Finn, Chelsea , journal=. Open
[6]

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year=

Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World , author=. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year=
[7]

Solving Rubik's Cube with a Robot Hand

Solving. arXiv preprint arXiv:1910.07113 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910
[8]

World Models

World Models , author=. arXiv preprint arXiv:1803.10122 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Advances in Neural Information Processing Systems , volume=

Learning Universal Policies via Text-Guided Video Generation , author=. Advances in Neural Information Processing Systems , volume=
[10]

Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Chen, Xi and Choromanski, Krzysztof and Ding, Tianli and Driess, Danny and Dubey, Avinava and Finn, Chelsea and others , booktitle=
[11]

Makoviychuk, Viktor and Wawrzyniak, Lukasz and Guo, Yunrong and Lu, Michelle and Storey, Kier and Macklin, Miles and Hoeller, David and Rudin, Nikita and Allshire, Arthur and Handa, Ankur and State, Gavriel , journal=
[12]

arXiv preprint arXiv:2009.13303 , year=

Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey , author=. arXiv preprint arXiv:2009.13303 , year=

work page arXiv 2009

[1] [1]

International Conference on Learning Representations , year=

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning , author=. International Conference on Learning Representations , year=

[2] [2]

arXiv preprint arXiv:2512.17853 , year=

AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning , author=. arXiv preprint arXiv:2512.17853 , year=

work page arXiv

[3] [3]

Cosmos World Foundation Model Platform for Physical

NVIDIA and Agarwal, Niket and Ali, Ahmed and others , journal=. Cosmos World Foundation Model Platform for Physical

[4] [4]

Proceedings of Robotics: Science and Systems (RSS) , year=

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author=. Proceedings of Robotics: Science and Systems (RSS) , year=

[5] [5]

Kim, Moo Jin and Pertsch, Karl and Karamcheti, Siddharth and Xiao, Ted and Balakrishna, Ashwin and Nair, Suraj and Rafailov, Rafael and Foster, Ethan and Lam, Grace and Sanketi, Pannag and Vuong, Quan and Kollar, Thomas and Burchfiel, Benjamin and Tedrake, Russ and Sadigh, Dorsa and Levine, Sergey and Liang, Percy and Finn, Chelsea , journal=. Open

[6] [6]

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year=

Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World , author=. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year=

[7] [7]

Solving Rubik's Cube with a Robot Hand

Solving. arXiv preprint arXiv:1910.07113 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910

[8] [8]

World Models

World Models , author=. arXiv preprint arXiv:1803.10122 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Advances in Neural Information Processing Systems , volume=

Learning Universal Policies via Text-Guided Video Generation , author=. Advances in Neural Information Processing Systems , volume=

[10] [10]

Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Chen, Xi and Choromanski, Krzysztof and Ding, Tianli and Driess, Danny and Dubey, Avinava and Finn, Chelsea and others , booktitle=

[11] [11]

Makoviychuk, Viktor and Wawrzyniak, Lukasz and Guo, Yunrong and Lu, Michelle and Storey, Kier and Macklin, Miles and Hoeller, David and Rudin, Nikita and Allshire, Arthur and Handa, Ankur and State, Gavriel , journal=

[12] [12]

arXiv preprint arXiv:2009.13303 , year=

Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey , author=. arXiv preprint arXiv:2009.13303 , year=

work page arXiv 2009