arxiv: 2604.11386 · v1 · submitted 2026-04-13 · 💻 cs.RO · cs.CV

Recognition: unknown

ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

Yiran Qin , Jiahua Ma , Li Kang , Wenzhan Li , Yihang Jiao , Xin Wen , Xiufeng Song , Heng Zhou

show 6 more authors

Jiwen Yu Zhenfei Yin Xihui Liu Philip Torr Yilun Du Ruimao Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:07 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords compositional simulationrobot data generationsim-to-real transferneural simulationdata augmentationpolicy trainingrobotics

0 comments

The pith

Compositional Simulation generates large-scale realistic robot training data from limited real examples by combining classical simulation with a neural video transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a hybrid approach called Compositional Simulation can solve the data bottleneck in robotics training by expanding a small collection of real-world recordings into much larger, diverse datasets that still match real conditions. It works through a closed loop: classical simulators produce varied action sequences, a neural component converts the resulting videos to look like real footage, and the outputs train better policies for physical robots. A reader would care because gathering enough varied real robot data by hand is expensive and slow, while pure simulation often fails when models move to the real world. If the method works, it would let researchers train capable robot controllers on far more data than direct collection allows, without losing accuracy in deployment.

Core claim

ComSim combines classical simulation for generating diverse action sequences with a neural simulator that converts those sequences into real-world visual representations. A closed-loop real-sim-real augmentation pipeline starts from a small real dataset, produces large quantities of consistent action-video pairs, and feeds them into policy training, yielding higher success rates for models operating in actual robot environments.

What carries the argument

The closed-loop real-sim-real data augmentation pipeline in which a neural simulator learns to render classical simulation videos as realistic footage while preserving action details.

If this is right

Policies trained on the generated data achieve higher success rates when deployed on physical robots because the domain gap is reduced.
The method produces training sets that cover more environmental variation than could be recorded directly from limited real-world effort.
Data volume can grow through simulation without requiring matching increases in physical data collection time or cost.
The same pipeline supports training of more capable robot policies for complex tasks that need broad scenario coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to generating training data for multi-step manipulation tasks where scenario diversity is especially hard to capture in reality.
If the neural conversion step generalizes across robot platforms, it might lower the amount of real data needed when switching to new hardware.
The pipeline offers a route to improve world models for robotics by supplying them with larger volumes of consistent real-looking video.

Load-bearing premise

The neural simulator can convert classical simulation videos into real-world appearances without introducing visual artifacts or distorting the motion information required for policy learning.

What would settle it

Real-world robot experiments in which policies trained on the generated datasets show no improvement in task success rates compared with policies trained only on classical simulation or the original small real dataset.

Figures

Figures reproduced from arXiv: 2604.11386 by Heng Zhou, Jiahua Ma, Jiwen Yu, Li Kang, Philip Torr, Ruimao Zhang, Wenzhan Li, Xihui Liu, Xin Wen, Xiufeng Song, Yihang Jiao, Yilun Du, Yiran Qin, Zhenfei Yin.

**Figure 1.** Figure 1: There are three main sources of real-world robotic data: (1) direct human collection, which yields high-quality samples but cannot scale; (2) classical simulators, which generate large datasets but suffer from appearance and physics gaps to reality; and (3) neural simulators trained on real data, which reduce these gaps but struggle with action-conditioned video generation, leading to weak action–video con… view at source ↗

**Figure 2.** Figure 2: (Left) Alignment between real-world and simulation: trajectories collected in the real world are replayed in simulation to generate paired video data for training the sim-to-real neural simulator. (Right) A DiT can be used to estimate scores conditioned on different dynamics, including Control Dynamics (actions) and Visual Dynamics (simulated observations). These scores can be composed during sampling to … view at source ↗

**Figure 3.** Figure 3: Real World Deployment with Compositional Simulation. Large volumes of (Vsim, A) pairs are collected from the classical simulator and transformed into corresponding (Vreal, A) pairs, referred to as Pseudo Real Data. These data, together with a small amount of real-world data, are used to train policies with improved success rates and generalization. conditioned on both Control and Visual Dynamics, are compo… view at source ↗

**Figure 4.** Figure 4: Visual comparison of generated results across four different tasks. movement). Finally, our full pipeline (Ours-Full) not only achieves photorealistic visual generation of the agent and scene, but also leverages motion guidance from control dynamics to enable accurate reproduction of real-world manipulation actions, ultimately realizing precise sim-to-real alignment across both visual perception and action… view at source ↗

**Figure 5.** Figure 5: Visualization of DP performance on Move Playing-Card Away. Top two rows: objects lie initially within the region predefined in collected real-world demonstrations (in-domain spatial distribution). Middle two rows: initial positions are outside the region (out-of-domain spatial distribution). Bottom four rows: introduce a colored background with varying levels of object clustering. Policies shown are traine… view at source ↗

**Figure 6.** Figure 6: Generalization visualization of DP on Shake Bottle under out-of-domain object distributions. Top: policy trained with 20 Real. Bottom: policy trained with 10 Real + 200 Pseudo Real. 5 Conclusion We presented Compositional Simulation, a hybrid framework that integrates classical and neural simulation through a real–sim–real pipeline to generate accurate and consistent action–video pairs. Our approach levera… view at source ↗

**Figure 7.** Figure 7: Definition of in-domain and out-of-domain spatial distributions in different tasks. Both terms refer exclusively to the initial position of objects before being manipulated. Positions are labeled in-domain if and only if they appear in the collected real-world demonstrations; all others are deemed out-of-domain. 8.2 DP Training Details Demonstrations Real-World Demonstrations were meticulously collected vi… view at source ↗

**Figure 8.** Figure 8: Real-world evaluation platform. 10.1 Background and Object Alignment Background Alignment mainly parameterizes both the visual appearance of the desktop and the laboratory walls. Using the fixed RGB-D camera described in Sec. 9.1, we first capture images of the table surface and the wall regions. A digital color-picker is then applied to the acquired images to extract representative RGB values. Regular-Obj… view at source ↗

**Figure 11.** Figure 11: 11.3 Visualization of Sim2Real Neural Simulation To dynamically demonstrate the effectiveness of our approach in sim-to-real transfer, we further present a visual comparison between the pseudo-realistic videos generated by our Neural Simulator and the initial simulation videos, as shown in [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 9.** Figure 9: Generalization visualization of DP on new objects. The top two rows are corresponds to Move Playing-Card Away, and the bottom two rows are corresponds to Shake Bottle. respectively [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Real2Sim alignment on Move Playing-Card Away. From top to bottom: Nongfu Spring Oriental Leaf Tea, Coca-Cola, Sprite, and Fanta [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: Real2Sim alignment on additional tasks. From top to bottom: Ranking Blocks RGB, Stack Blocks Three and Stack Blocks Two [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12 [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

read the original abstract

Recent advancements in foundational models, such as large language models and world models, have greatly enhanced the capabilities of robotics, enabling robots to autonomously perform complex tasks. However, acquiring large-scale, high-quality training data for robotics remains a challenge, as it often requires substantial manual effort and is limited in its coverage of diverse real-world environments. To address this, we propose a novel hybrid approach called Compositional Simulation, which combines classical simulation and neural simulation to generate accurate action-video pairs while maintaining real-world consistency. Our approach utilizes a closed-loop real-sim-real data augmentation pipeline, leveraging a small amount of real-world data to generate diverse, large-scale training datasets that cover a broader spectrum of real-world scenarios. We train a neural simulator to transform classical simulation videos into real-world representations, improving the accuracy of policy models trained in real-world environments. Through extensive experiments, we demonstrate that our method significantly reduces the sim2real domain gap, resulting in higher success rates in real-world policy model training. Our approach offers a scalable solution for generating robust training data and bridging the gap between simulated and real-world robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ComSim sketches a closed-loop hybrid pipeline to turn limited real robot data into large realistic training sets via classical sim plus neural video translation, but the abstract supplies no numbers or checks on whether actions survive the translation.

read the letter

The core idea is straightforward: collect a small real dataset, run classical simulation to create varied scenarios with known actions, train a neural simulator to make those videos look real, then use the output pairs to train policies. The closed-loop part feeds the generated data back to expand coverage. That framing directly targets the data collection headache in robotics and gives a concrete way to scale without endless real-world runs. The pipeline description is clear enough that someone could sketch an implementation from it. What the paper does well is keep the focus on maintaining real-world consistency while generating diversity, which is the right priority for sim-to-real work. The stress-test concern about action fidelity is the right one to watch. If the neural step adds artifacts or shifts motion, the training pairs become noisy and the claimed gap reduction disappears. The abstract asserts extensive experiments with higher success rates and reduced domain gap, yet it contains zero quantitative results, no baselines, no ablation on the neural component, and no metric for action or physics preservation. That absence makes the central claim impossible to assess from the given text. Hybrid neural-classical sim methods already exist in the literature, so the paper will need to show exactly how the compositional closed-loop version improves on them. This is for robotics groups working on policy learning and data-efficient sim-to-real transfer. A reader hunting for practical data-generation recipes could pull the high-level structure and try it, but would have to fill in the missing experiments themselves. It deserves peer review because the problem is real and the proposal is specific enough to test, though any referee will rightly demand the numbers and fidelity checks before accepting the claims.

Referee Report

2 major / 1 minor

Summary. The paper proposes ComSim, a hybrid compositional simulation method that combines classical simulation with a neural simulator trained on a small amount of real-world data. It employs a closed-loop real-sim-real data augmentation pipeline to generate large-scale, diverse action-video pairs that aim to cover broader real-world scenarios, claiming this reduces the sim2real domain gap and yields higher success rates for real-world robot policy training based on extensive experiments.

Significance. If the neural simulator accurately transforms simulation videos while preserving action trajectories, physics, and semantics without artifacts, the approach could provide a practical, scalable route to augment limited real robot data and improve sim2real transfer. The closed-loop pipeline is a constructive element that could support iterative refinement.

major comments (2)

[Abstract] Abstract: the central claims that the method 'significantly reduces the sim2real domain gap' and produces 'higher success rates in real-world policy model training' are asserted without any quantitative results, baselines, statistical tests, or error analysis, leaving the empirical contribution unsupported.
[Method and Experiments] Method/Experiments: no implementation details, neural simulator architecture, training procedure, or fidelity metrics (e.g., action reconstruction error, optical-flow consistency, or policy-ablation deltas) are supplied to verify that the transformation preserves underlying action trajectories and dynamics, which is load-bearing for the claim that generated data improves rather than degrades downstream policies.

minor comments (1)

[Abstract] The term 'Compositional Simulation' is used throughout but its precise compositional structure (how classical and neural components are combined at the data-generation level) is not formally defined or contrasted with prior hybrid simulation work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We appreciate the constructive feedback and will revise the manuscript to address the concerns raised regarding the abstract and the method/experiments sections.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims that the method 'significantly reduces the sim2real domain gap' and produces 'higher success rates in real-world policy model training' are asserted without any quantitative results, baselines, statistical tests, or error analysis, leaving the empirical contribution unsupported.

Authors: We agree with this observation. While the experiments section contains quantitative results supporting these claims, the abstract does not include specific numbers or references to baselines. In the revised manuscript, we will update the abstract to incorporate key quantitative findings, such as the percentage reduction in domain gap and success rate improvements with statistical details, to better support the empirical contributions. revision: yes
Referee: [Method and Experiments] Method/Experiments: no implementation details, neural simulator architecture, training procedure, or fidelity metrics (e.g., action reconstruction error, optical-flow consistency, or policy-ablation deltas) are supplied to verify that the transformation preserves underlying action trajectories and dynamics, which is load-bearing for the claim that generated data improves rather than degrades downstream policies.

Authors: We acknowledge that additional details are necessary to substantiate the claims. The current version provides an overview but lacks the requested specifics. We will revise the Method and Experiments sections to include the neural simulator's architecture, detailed training procedure, and fidelity metrics including action reconstruction error, optical-flow consistency checks, and policy ablation studies with performance deltas. This will demonstrate that the data generation preserves trajectories and dynamics and improves policy performance. revision: yes

Circularity Check

0 steps flagged

No circularity: high-level empirical pipeline with no derivations or self-referential fits

full rationale

The provided abstract and description contain no equations, parameter fits, uniqueness theorems, or derivation chains. The method is described as a closed-loop data augmentation pipeline trained on real data to generate simulated-to-real videos, with success claimed via downstream experiments. No load-bearing step reduces to its own inputs by construction, self-citation, or renaming. This matches the default case of a self-contained empirical claim whose validity rests on external validation rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities beyond the high-level method name are identifiable.

invented entities (1)

Compositional Simulation no independent evidence
purpose: Hybrid classical-neural simulation for scalable real-world robot data generation
New term and pipeline introduced in the abstract as the core contribution.

pith-pipeline@v0.9.0 · 5542 in / 1168 out tokens · 42008 ms · 2026-05-10T16:07:30.856777+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 32 canonical work pages · 17 internal anchors

[2]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review arXiv 2025
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:pi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review arXiv 2023
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023)

work page internal anchor Pith review arXiv 2023
[7]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review arXiv 2022
[8]

In: ICML (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: ICML (2024)

2024
[9]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Cheang, C.L., Chen, G., Jing, Y., Kong, T., Li, H., Li, Y., Liu, Y., Wu, H., Xu, J., Yang, Y., et al.: Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158 (2024)

work page internal anchor Pith review arXiv 2024
[10]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Liang, Q., Li, Z., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)

work page internal anchor Pith review arXiv 2025
[11]

Urdformer: A pipeline for con- structing articulated simulation environments from real-world images,

Chen, Z., Walsman, A., Memmel, M., Mo, K., Fang, A., Vemuri, K., Wu, A., Fox, D., Gupta, A.: Urdformer: A pipeline for constructing articulated simulation environments from real-world images. arXiv preprint arXiv:2405.11656 (2024)

work page arXiv 2024
[12]

The International Journal of Robotics Research p

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research p. 02783649241273668 (2023)

2023
[13]

arXiv preprint arXiv:2410.07408 (2024) 16 Y

Dai, T., Wong, J., Jiang, Y., Wang, C., Gokmen, C., Zhang, R., Wu, J., Fei-Fei, L.: Automated creation of digital cousins for robust policy learning. arXiv preprint arXiv:2410.07408 (2024) 16 Y. Qin, J. Ma, L. Kang, W. Li et al

work page arXiv 2024
[14]

Advances in neural information processing systems36, 9156–9172 (2023)

Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)

2023
[15]

In: The Eleventh International Conference on Learning Representations (2023)

Gu, J., Xiang, F., Li, X., Ling, Z., Liu, X., Mu, T., Tang, Y., Tao, S., Wei, X., Yao, Y., et al.: Maniskill2: A unified benchmark for generalizable manipulation skills. In: The Eleventh International Conference on Learning Representations (2023)

2023
[16]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

work page internal anchor Pith review arXiv 2021
[17]

Advances in neural information processing systems30(2017)

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017
[18]

3d diffuser actor: Policy diffusion with 3d scene representations, 2024

Ke, T.W., Gkanatsios, N., Fragkiadaki, K.: 3d diffuser actor: Policy diffusion with 3d scene representations. arXiv preprint arXiv:2402.10885 (2024)

work page arXiv 2024
[19]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)

work page internal anchor Pith review arXiv 2025
[20]

In: 8th Annual Conference on Robot Learning

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., et al.: Openvla: An open-source vision- language-action model. In: 8th Annual Conference on Robot Learning
[21]

Tenenbaum

Ko, P.C., Mao, J., Du, Y., Sun, S.H., Tenenbaum, J.B.: Learning to act from actionless videos through dense correspondences. arXiv preprint arXiv:2310.08576 (2023)

work page arXiv 2023
[22]

arXiv (2017)

Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv (2017)

2017
[23]

In: 6th Annual Conference on Robot Learning (2022), https://openreview.net/forum?id=_8DoIe8G3t

Li, C., Zhang, R., Wong, J., Gokmen, C., Srivastava, S., Martín-Martín, R., Wang, C., Levine, G., Lingelbach, M., Sun, J., Anvari, M., Hwang, M., Sharma, M., Aydin, A., Bansal, D., Hunter, S., Kim, K.Y., Lou, A., Matthews, C.R., Villa-Renteria, I., Tang, J.H., Tang, C., Xia, F., Savarese, S., Gweon, H., Liu, K., Wu, J., Fei-Fei, L.: BEHAVIOR-1k: A benchma...

2022
[24]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., et al.: Cogact: A foundational vision-language-action model for synergiz- ing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650 (2024)

work page Pith review arXiv 2024
[25]

In: International Conference on Machine Learning

Liang, Z., Mu, Y., Ding, M., Ni, F., Tomizuka, M., Luo, P.: Adaptdiffuser: Diffusion models as adaptive self-evolving planners. In: International Conference on Machine Learning. pp. 20725–20745. PMLR (2023)

2023
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liang, Z., Mu, Y., Ma, H., Tomizuka, M., Ding, M., Luo, P.: Skilldiffuser: Inter- pretable hierarchical planning via skill abstractions in diffusion-based task execution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16467–16476 (2024)

2024
[27]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liang, Z., Mu, Y., Wang, Y., Chen, T., Shao, W., Zhan, W., Tomizuka, M., Luo, P., Ding, M.: Dexhanddiff: Interaction-aware diffusion planning for adaptive dexterous manipulation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1745–1755 (2025)

2025
[28]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 (2024) Compositional Simulation 17

work page internal anchor Pith review arXiv 2024
[29]

IEEE Robotics and Automation Letters (2023)

Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., Florence, P.: Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters (2023)

2023
[30]

In: Vanschoren, J., Yeung, S

Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., State, G.: Isaac gym: High performance GPU based physics simulation for robot learning. In: Vanschoren, J., Yeung, S. (eds.) Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Dataset...

2021
[31]

In: 7th Annual Conference on Robot Learning (2023)

Mandlekar, A., Nasiriany, S., Wen, B., Akinola, I., Narang, Y., Fan, L., Zhu, Y., Fox, D.: Mimicgen: A data generation system for scalable robot learning using human demonstrations. In: 7th Annual Conference on Robot Learning (2023)

2023
[32]

RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920,

Mu, Y., Chen, T., Peng, S., Chen, Z., Gao, Z., Zou, Y., Lin, L., Xie, Z., Luo, P.: Robotwin: Dual-arm robot benchmark with generative digital twins (early version). arXiv preprint arXiv:2409.02920 (2024)

work page arXiv 2024
[33]

In: Robotics: Science and Systems (RSS) (2024)

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: Robotics: Science and Systems (RSS) (2024)

2024
[34]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

work page internal anchor Pith review arXiv 2025
[35]

In: Proceedings of Robotics: Science and Systems

Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., Kreiman, T., Tan, Y., Chen, L.Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., Levine, S.: Octo: An open-source generalist robot policy. In: Proceedings of Robotics: Science and Systems. Delft, Netherlands (2024)

2024
[36]

OpenAI: Creating video from text.https://openai.com/index/sora/(2024)

2024
[37]

OpenAI: Gpt-5 system card (updated august 13, 2025).https://cdn.openai.com/ pdf/8124a3ce- ab78- 4f06- 96eb- 49ea29ffb52f/gpt5- system- card- aug7.pdf (Aug 2025)

2025
[38]

arXiv preprint arXiv:2503.16408 (2025)

Qin, Y., Kang, L., Song, X., Yin, Z., Liu, X., Liu, X., Zhang, R., Bai, L.: Robofactory: Exploring embodied agent collaboration with compositional constraints. arXiv preprint arXiv:2503.16408 (2025)

work page arXiv 2025
[39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022
[40]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRRabs/1707.06347(2017), http://arxiv.org/abs/ 1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N., Mukadam, M., Chaplot, D., Maksymets, O., Gokaslan, A., Vondrus, V., Dharur, S., Meier, F., Galuba, W., Chang, A., Kira, Z., Koltun, V., Malik, J., Savva, M., Batra, D.: Habitat 2.0: Training home assistants to rearrange their habitat. In: Advances in Neural Information P...

2021
[42]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 5026–5033. IEEE (2012).https://doi.org/10.1109/IROS.2012.6386109 18 Y. Qin, J. Ma, L. Kang, W. Li et al

work page doi:10.1109/iros.2012.6386109 2012
[44]

Reconciling reality through simulation: A real- to-sim-to-real approach for robust manipulation,

Torne, M., Simeonov, A., Li, Z., Chan, A., Chen, T., Gupta, A., Agrawal, P.: Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. arXiv preprint arXiv:2403.03949 (2024)

work page arXiv 2024
[45]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

work page internal anchor Pith review arXiv 2018
[47]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Wang, C., Fang, H., Fang, H.S., Lu, C.: Rise: 3d perception makes real-world robot imitation simple and effective. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 2870–2877. IEEE (2024)

2024
[49]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Wen, J., Zhu, Y., Li, J., Tang, Z., Shen, C., Feng, F.: Dexvla: Vision-language model with plug-in diffusion expert for general robot control. arXiv preprint arXiv:2502.05855 (2025)

work page Pith review arXiv 2025
[50]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Wu, H., Jing, Y., Cheang, C., Chen, G., Xu, J., Li, X., Liu, M., Li, H., Kong, T.: Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 (2023)

work page internal anchor Pith review arXiv 2023
[51]

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., Yi, L., Chang, A.X., Guibas, L.J., Su, H.: SAPIEN: A simulated part-based interactive environment. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

2020
[52]

Demogen: Syn- thetic demonstration generation for data-efficient visuo- motor policy learning.arXiv preprint arXiv:2502.16932, 2025

Xue, Z., Deng, S., Chen, Z., Wang, Y., Yuan, Z., Xu, H.: Demogen: Synthetic demon- stration generation for data-efficient visuomotor policy learning. arXiv preprint arXiv:2502.16932 (2025)

work page arXiv 2025
[53]

Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114 (2023)

work page arXiv 2023
[54]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Fresco: Spatial-temporal correspondence for zero-shot video translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8703–8712 (2024)

2024
[55]

In: CoRL 2024 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond

Ye, S., Jang, J., Jeon, B., Joo, S.J., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.W., Lin, B.Y., et al.: Latent action pretraining from videos. In: CoRL 2024 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond

2024
[56]

Yu, J., Qin, Y., Wang, X., Wan, P., Zhang, D., Liu, X.: Gamefactory: Creating new games with generative interactive videos (2025)

2025
[57]

Scaling robot learning with semantically imagined experience,

Yu, T., Xiao, T., Stone, A., Tompson, J., Brohan, A., Wang, S., Singh, J., Tan, C., Peralta, J., Ichter, B., et al.: Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550 (2023)

work page arXiv 2023
[58]

In: Proceedings of Robotics: Science and Systems (RSS) (2024)

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d representations. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

2024
[59]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

2018
[60]

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all (March 2024), https://github.com/hpcaitech/Open-Sora Compositional Simulation 19

2024
[61]

Ro- bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024)

work page arXiv 2024
[62]

Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: Learning interactive real-robot action simulators. arXiv preprint arXiv:2406.14540 (2024)

work page arXiv 2024
[63]

Change the image style from the image style of the simulated environment to the image style captured by a DSLR camera

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) Appendix 6 Use of LLMs This paper was written by the authors without any generative contribution from larg...

2023