Efficient Sim-to-Real Transfer of World-Action Models from Synthetic Priors
Pith reviewed 2026-07-01 05:49 UTC · model grok-4.3
The pith
A world-action model trained only on synthetic data transfers zero-shot to a real Franka robot at 35 percent success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a video diffusion policy on synthetic demonstrations generated in heavily randomized simulation environments using an existing motion-planning pipeline, the authors obtain a world-action model that, when deployed zero-shot on a Franka robot, reaches 35 percent average success across three manipulation tasks and constitutes the first reported successful sim-to-real transfer of such a model.
What carries the argument
Cosmos Policy, a video diffusion model adapted to output both future images and actions for visuomotor control, trained on domain-randomized simulation data and AnyTask-generated demonstrations.
If this is right
- Only synthetic data is needed to train a deployable world-action model for these manipulation tasks.
- Zero-shot deployment on a physical arm is possible without any real-world fine-tuning or demonstrations.
- The same training recipe applies across object lifting, drawer opening, and pick-and-place.
- Roughly 800 synthetic demonstrations per task suffice to reach the reported 35 percent success level.
Where Pith is reading between the lines
- If simulation fidelity and randomization improve further, the same pipeline could support a wider range of contact-rich tasks.
- The approach opens the possibility of pre-training world-action models on massive synthetic corpora before any hardware contact.
- Success rates might rise if the policy is allowed light online adaptation on the real robot after the zero-shot phase.
Load-bearing premise
The combination of domain randomization in simulation and motion-planning demonstrations produces data distributions close enough to real conditions for zero-shot transfer to succeed.
What would settle it
Running the trained policy on the same Franka robot and measuring near-zero success rates across the three tasks would show that the sim-to-real transfer did not occur.
Figures
read the original abstract
Bridging the sim-to-real gap is a core challenge in deploying learned manipulation policies. Sim-to-real learning is attractive because it can replace expensive real robot demonstrations with scalable synthetic data, yet world-action models have not previously been shown to transfer from simulation to real robotic manipulation. We study whether a world-action model can be trained from synthetic priors and deployed zero-shot in the real world. To this end, we build upon Cosmos Policy, a video diffusion model adapted for visuomotor control. We construct simulation environments with extensive domain randomization and generate demonstrations using the AnyTask motion planning pipeline. We evaluate our approach across object lifting, drawer opening, and pick-and-place tasks using ${\sim}800$ synthetic demonstrations per task and no real demonstrations. When deployed zero-shot on a Franka Robot, our policy attains a 35\% average success rate. To our knowledge, this represents the first successful sim-to-real transfer of a world-action model for robotic manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to demonstrate the first successful zero-shot sim-to-real transfer of a world-action model (Cosmos Policy, a video diffusion model adapted for visuomotor control) for robotic manipulation. Using only synthetic data (~800 domain-randomized demonstrations per task generated via the AnyTask motion planning pipeline in simulation), the approach trains a policy that achieves a 35% average success rate when deployed on a real Franka robot for object lifting, drawer opening, and pick-and-place tasks, with no real demonstrations used.
Significance. If the empirical results and transfer assumption hold after verification, the work would be significant for robotic learning by showing that synthetic priors alone can enable deployment of diffusion-based world-action models in real manipulation without real-robot data collection. This addresses a key scalability challenge in sim-to-real transfer and extends domain randomization techniques to this model class. The paper receives credit for focusing on a falsifiable empirical outcome (success rate on physical hardware) and for targeting a previously unshown transfer setting.
major comments (2)
- [Abstract] Abstract: The headline result of a 35% average success rate is presented without baselines, variance or standard deviation across trials, number of evaluation rollouts, task-specific success criteria, or evaluation protocol details. These omissions make it impossible to determine whether the rate reflects meaningful transfer or could be explained by task simplicity or chance.
- [Abstract] Abstract: The zero-shot sim-to-real claim is load-bearing on the assumption that the ~800 AnyTask-generated, domain-randomized trajectories lie inside the support of the real Franka data distribution for the Cosmos Policy. No quantitative support (image statistics, action histograms, friction/contact dynamics mismatch, or distribution divergence metrics) is referenced, leaving the attribution of the 35% rate to successful world-action model transfer unverified.
minor comments (1)
- [Abstract] Abstract: The description of Cosmos Policy as 'a video diffusion model adapted for visuomotor control' would benefit from a one-sentence architectural summary or citation to the base model to aid readers unfamiliar with the prior work.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses and indicate where revisions have been made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result of a 35% average success rate is presented without baselines, variance or standard deviation across trials, number of evaluation rollouts, task-specific success criteria, or evaluation protocol details. These omissions make it impossible to determine whether the rate reflects meaningful transfer or could be explained by task simplicity or chance.
Authors: We agree that additional statistical and methodological details are required for proper interpretation of the headline result. In the revised manuscript, the abstract has been expanded to report the number of evaluation rollouts (20 per task), the standard deviation across trials, and explicit references to the task-specific success criteria and evaluation protocol now detailed in Section 4 and the supplementary material. Baselines are discussed in the main experiments section rather than the abstract due to length constraints. revision: yes
-
Referee: [Abstract] Abstract: The zero-shot sim-to-real claim is load-bearing on the assumption that the ~800 AnyTask-generated, domain-randomized trajectories lie inside the support of the real Franka data distribution for the Cosmos Policy. No quantitative support (image statistics, action histograms, friction/contact dynamics mismatch, or distribution divergence metrics) is referenced, leaving the attribution of the 35% rate to successful world-action model transfer unverified.
Authors: We acknowledge the value of quantitative distribution alignment metrics to further support the transfer assumption. The original submission relied on the established practice of domain randomization (as in prior sim-to-real works) without explicit metrics. We have added a dedicated limitations paragraph in the revised discussion section that explicitly states this assumption and its reliance on randomization coverage, while noting that full divergence metrics would require new experiments. The primary evidence remains the zero-shot real-robot success rate achieved with no real data. revision: partial
Circularity Check
No circularity detected; result is direct empirical measurement with no derivation reducing to inputs by construction
full rationale
The paper presents an empirical outcome: a 35% average success rate achieved by deploying a diffusion policy trained solely on ~800 domain-randomized synthetic demonstrations per task, with zero real data. No equations, fitted parameters, predictions, or self-citations are referenced in the abstract or described text that would reduce the reported success rate or the 'first successful transfer' claim to a tautology or prior fit. The load-bearing assumption about synthetic-to-real distribution closeness is an unverified premise affecting validity, not a circular step in any derivation chain. This matches the default case of a self-contained empirical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Simulation environments with domain randomization produce data distributions close enough to reality for zero-shot policy transfer.
Reference graph
Works this paper leans on
-
[1]
International Conference on Learning Representations , year=
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning , author=. International Conference on Learning Representations , year=
-
[2]
arXiv preprint arXiv:2512.17853 , year=
AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning , author=. arXiv preprint arXiv:2512.17853 , year=
-
[3]
Cosmos World Foundation Model Platform for Physical
NVIDIA and Agarwal, Niket and Ali, Ahmed and others , journal=. Cosmos World Foundation Model Platform for Physical
-
[4]
Proceedings of Robotics: Science and Systems (RSS) , year=
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author=. Proceedings of Robotics: Science and Systems (RSS) , year=
-
[5]
Kim, Moo Jin and Pertsch, Karl and Karamcheti, Siddharth and Xiao, Ted and Balakrishna, Ashwin and Nair, Suraj and Rafailov, Rafael and Foster, Ethan and Lam, Grace and Sanketi, Pannag and Vuong, Quan and Kollar, Thomas and Burchfiel, Benjamin and Tedrake, Russ and Sadigh, Dorsa and Levine, Sergey and Liang, Percy and Finn, Chelsea , journal=. Open
-
[6]
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year=
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World , author=. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year=
-
[7]
Solving Rubik's Cube with a Robot Hand
Solving. arXiv preprint arXiv:1910.07113 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[8]
World Models , author=. arXiv preprint arXiv:1803.10122 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Advances in Neural Information Processing Systems , volume=
Learning Universal Policies via Text-Guided Video Generation , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Chen, Xi and Choromanski, Krzysztof and Ding, Tianli and Driess, Danny and Dubey, Avinava and Finn, Chelsea and others , booktitle=
-
[11]
Makoviychuk, Viktor and Wawrzyniak, Lukasz and Guo, Yunrong and Lu, Michelle and Storey, Kier and Macklin, Miles and Hoeller, David and Rudin, Nikita and Allshire, Arthur and Handa, Ankur and State, Gavriel , journal=
-
[12]
arXiv preprint arXiv:2009.13303 , year=
Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey , author=. arXiv preprint arXiv:2009.13303 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.