pith. machine review for the scientific record. sign in

arxiv: 2604.15805 · v1 · submitted 2026-04-17 · 💻 cs.RO · cs.AI

Recognition: unknown

From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:42 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords generative simulationdigital cousinssim-to-real transferrobot learningdata augmentationpanorama mappingembodied AIscene editing
0
0 comments X

The pith

A generative framework maps real-world panoramas to editable high-fidelity simulation scenes that scale data for better robot policy generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a generative method that converts real panoramic images into detailed simulation environments. Semantic and geometric edits then create many varied versions called Digital Cousins for efficient data augmentation. This sidesteps the expense of gathering diverse physical assets and reconfiguring real spaces for robot training. Scaling the volume of these generated scenes produces policies that transfer more reliably to previously unseen scenes and objects. The work also stitches multiple rooms into consistent large environments to handle longer navigation sequences.

Core claim

The central claim is that a generative real-to-sim mapping from real-world panoramas, followed by semantic and geometric editing to synthesize Digital Cousins, yields high-fidelity interactive scenes. When training data is scaled using these scenes, robot policies exhibit strong sim-to-real performance correlation and markedly improved generalization to scene and object variations not encountered during training.

What carries the argument

The generative real-to-sim panorama mapping combined with semantic and geometric editing to produce diverse Digital Cousins scenes that support physics-based interaction.

If this is right

  • Extensively scaling generated cousin scenes produces significantly better generalization to unseen scene and object variations.
  • Strong correlation between simulation and real-world performance validates the platform's fidelity for policy training.
  • Multi-room stitching creates consistent large-scale environments usable for long-horizon navigation tasks.
  • The scenes support interactive manipulation because they incorporate high-quality physics engines and realistic assets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Digital Cousins could lower the barrier to training policies across many different physical settings by replacing repeated real-world collection with targeted edits.
  • The same panorama-to-scene pipeline might be applied to generate training data for tasks that involve dynamic changes, such as moving obstacles or changing lighting.
  • Integrating cousin scenes with existing real datasets could create hybrid training regimes that further close remaining performance gaps.

Load-bearing premise

The generated and edited simulation scenes must accurately reproduce real-world physics, lighting, and object behaviors so that policies trained in them transfer without large domain gaps.

What would settle it

Policies trained on large sets of Digital Cousins scenes would fail to match the real-world success rates of policies trained on comparable real data when tested on the same manipulation or navigation tasks.

Figures

Figures reproduced from arXiv: 2604.15805 by Chen Xie, Feng Jiang, Jade Yang, Jasper Lu, Jingkai Xu, Ruihai Wu, Shawn Xie, Shengqiang Xu, Shugao Liu, Yuanfei Wang, Zhenhao Shen.

Figure 1
Figure 1. Figure 1: WorldComposer generates high-fidelity simulation scenes from real-world panorama, and further generates diverse cousin scenes through editing. The generated rooms can be seamlessly stitched together into multi-room environments for navigation. Combined with interactive tasks, we provide a platform for generalizable learning and evaluation. Abstract—Learning robust robot policies in real-world environ￾ments… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of WorldComposer Environment Generation. Our framework enables the rapid creation of interactive, high￾fidelity simulation environments from real-world data. (Top) Generative Real-to-Sim Scene Construction: Using panoramic captures and our Marble engine, we reconstruct a “Sim Twin” and generate diverse “Sim Cousins” through prompt-based world editing. (Middle) Multi-Room Stitching: To handle compl… view at source ↗
Figure 3
Figure 3. Figure 3: Simulation Tasks and Digital Cousin Generation. Our pipeline supports a diverse range of high-fidelity simulation tasks. And each row demonstrates the transition from a sparse real-world capture (Real) to a precise digital reconstruction (Twin), followed by the generation of multiple Digital Cousins. Starting from a single panoramic image, we exploit the prompt-driven editing capabilities of Marble to synt… view at source ↗
Figure 4
Figure 4. Figure 4: Real-World Setup. The hardware configuration con￾sists of two lerobot arms and diverse object instances. A. Experimental Setup Simulation Setting. Built on NVIDIA Isaac Sim, our simulation features a kinematic “Digital Twin” of the physical robot. We integrate the LeRobot interface directly into the simulation loop to enforce a shared control stack. This ensures strict consistency between the simulated “Di… view at source ↗
Figure 5
Figure 5. Figure 5: Real-to-Sim-to-Real Pipeline and Qualitative Analysis. (Top) The co-training framework with real-to-sim cousins. (Bottom) The comparison of policy execution in unseen scenarios. Aug denotes the augmentation of generated data. 0 200 400 600 800 1000 Number of Added Cousin Sim Data 10% 30% 50% 70% 90% Real World Success Rate 10% 25% 45% 65% 80% 85% Baseline (50 Real Data Only) Scaling Trend Success Rate [PI… view at source ↗
Figure 6
Figure 6. Figure 6: Performance with Increasing Cousin Sim Data. Real-world success rate on Set Tableware task with increasing cousin sim data under the unseen scene and object variations. Besides, we observe significant gains when supplementing real data with generated data. As shown in Table III and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sim-Real Evaluation Correlation. The scatter plot illustrates the average success rates across three tasks (Set Tableware, Open Microwave, Fold Cloth) for various baselines (ACT, DP, SmolVLA, π0) under four generalization levels. relationship between the simulation and real-world success rates. The data points, representing various baseline methods (ACT, DP, SmolVLA, and π0) across multiple generalization … view at source ↗
Figure 9
Figure 9. Figure 9: Real world vs. Simulation navigation trajectories and [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional Simulation Tasks. Visualization of the Digital Twin (left) and generated Digital Cousin (right) envi￾ronments for the tasks not displayed in the main manuscrip. • Simulation: We utilized a dataset comprising 100 expert trajectories per task for training. During the evaluation phase, each policy was tested over 100 independent trials for each generalization setting (Train, Unseen Scene, Unseen O… view at source ↗
read the original abstract

Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a generative real-to-sim framework that converts real-world panoramas into high-fidelity simulation scenes and further creates diverse 'Digital Cousins' via semantic and geometric edits. It incorporates multi-room stitching for large-scale consistent environments and claims that the resulting scenes support interactive robot manipulation and navigation tasks. Experiments are said to show strong sim-to-real correlation validating platform fidelity, and that scaling data generation yields significantly better generalization to unseen scene and object variations.

Significance. If the central claims hold with proper controls, the work could meaningfully advance scalable sim-to-real robot learning by offering a practical way to augment limited real data with editable, high-fidelity synthetic scenes. The emphasis on controlled diversity through editing and large-scale environment construction addresses important bottlenecks in data collection cost and policy generalization.

major comments (2)
  1. [Scaling experiments] Scaling experiments section: no ablation is reported that holds total training scene count fixed while varying only the semantic/geometric edit mechanism versus other sources of diversity. This is load-bearing for the claim that Digital Cousins specifically drive generalization gains to unseen variations, as opposed to benefits from raw increases in data volume or variety.
  2. [Abstract and results] Abstract and results summary: the claims of 'strong sim-to-real correlation' and 'significantly better generalization' are stated without accompanying quantitative metrics, baseline comparisons, error bars, statistical tests, or analysis of potential artifacts from the generative editing process (e.g., inconsistencies in contact dynamics or lighting). These details are required to substantiate the central experimental assertions.
minor comments (1)
  1. [Introduction] The term 'Digital Cousins' is used throughout without an explicit formal definition or direct comparison table to related concepts such as digital twins or domain randomization in the introduction or related work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects for strengthening the experimental validation of our claims. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Scaling experiments] Scaling experiments section: no ablation is reported that holds total training scene count fixed while varying only the semantic/geometric edit mechanism versus other sources of diversity. This is load-bearing for the claim that Digital Cousins specifically drive generalization gains to unseen variations, as opposed to benefits from raw increases in data volume or variety.

    Authors: We agree that this ablation would more directly isolate the contribution of the semantic and geometric editing process. Our scaling experiments demonstrate that increasing the volume of data generated via Digital Cousins improves generalization, but they do not hold total scene count constant to separate the effect of edits from other sources of variety. In the revised manuscript, we will add this controlled ablation, comparing policies trained on edited versus unedited scenes with matched total counts, to better support the specific role of Digital Cousins. revision: yes

  2. Referee: [Abstract and results] Abstract and results summary: the claims of 'strong sim-to-real correlation' and 'significantly better generalization' are stated without accompanying quantitative metrics, baseline comparisons, error bars, statistical tests, or analysis of potential artifacts from the generative editing process (e.g., inconsistencies in contact dynamics or lighting). These details are required to substantiate the central experimental assertions.

    Authors: The full results section reports quantitative metrics including sim-to-real correlation values, success rate improvements over baselines, and generalization gaps to unseen scenes. However, we acknowledge that the abstract and high-level summary would benefit from explicit inclusion of these numbers, error bars, and statistical tests. We will revise the abstract to incorporate key quantitative results and add a discussion of potential generative artifacts (such as lighting and contact consistency) with supporting analysis in the experiments section. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical validation of generative pipeline stands independently

full rationale

The paper describes a generative real-to-sim pipeline that maps panoramas to simulation scenes via semantic/geometric edits, then evaluates the resulting platform through separate experiments measuring sim-to-real correlation and generalization gains from scaled data. No equations, derivations, or claims reduce by construction to fitted parameters, self-definitions, or self-citations; the central results are presented as outcomes of external validation rather than tautological renamings or imported uniqueness theorems. The methodology is self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the generative model faithfully reproducing real-world properties and on the assumption that increased data diversity directly improves generalization without simulation-specific biases.

axioms (1)
  • domain assumption High-quality physics engines combined with realistic assets sufficiently model real-world interactions for manipulation and navigation tasks
    Invoked to support interactive use of generated scenes.
invented entities (1)
  • Digital Cousins no independent evidence
    purpose: Diverse simulation scenes synthesized from real panoramas through semantic and geometric editing
    Core new concept for scalable data augmentation in robot learning.

pith-pipeline@v0.9.0 · 5490 in / 1261 out tokens · 42812 ms · 2026-05-10T08:42:06.848889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 34 canonical work pages · 10 internal anchors

  1. [1]

    Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin,

    Jad Abou-Chakra, Lingfeng Sun, Krishan Rana, Brandon May, Karl Schmeckpeper, Niko Suenderhauf, Maria Vit- toria Minniti, and Laura Herlant. Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin, 2025. URL https://arxiv.org/abs/2504.03597

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  3. [3]

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    Gr-3 technical report.arXiv preprint arXiv:2507.15493,

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, and Yichu Yang. Gr-3 techni- cal report, 2025. URL https://arxiv.org/abs/2507.15493

  5. [5]

    Large Video Planner Enables Generalizable Robot Control

    Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control, 2025. URL https://arxiv.org/ abs/2512.15840

  6. [6]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, Xianliang Lin, Yi- heng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain random- ization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  7. [7]

    Object modelling by registration of multiple range images.Image and Vision Computing, 10(3):145–155, 1992

    Yang Chen and G ´erard Medioni. Object modelling by registration of multiple range images.Image and Vision Computing, 10(3):145–155, 1992. ISSN 0262-8856. doi: https://doi.org/10.1016/0262-8856(92) 90066-C. URL https://www.sciencedirect.com/science/ article/pii/026288569290066C. Range Image Under- standing

  8. [8]

    Urdformer: A pipeline for constructing articulated simulation environments from real-world images, 2024

    Zoey Chen, Aaron Walsman, Marius Memmel, Kaichun Mo, Alex Fang, Karthikeya Vemuri, Alan Wu, Dieter Fox, and Abhishek Gupta. Urdformer: A pipeline for constructing articulated simulation environments from real-world images, 2024. URL https://arxiv.org/abs/2405. 11656

  9. [9]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

  10. [10]

    arXiv preprint arXiv:2505.07096 (2025)

    Prithwish Dan, Kushal Kedia, Angela Chao, Ed- ward Weiyi Duan, Maximus Adrian Pace, Wei-Chiu Ma, and Sanjiban Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real, 2025. URL https://arxiv. org/abs/2505.07096

  11. [11]

    Phone2proc: Bringing robust robots into our chaotic world, 2022

    Matt Deitke, Rose Hendrix, Luca Weihs, Ali Farhadi, Kiana Ehsani, and Aniruddha Kembhavi. Phone2proc: Bringing robust robots into our chaotic world, 2022. URL https://arxiv.org/abs/2212.04819

  12. [12]

    Superpoint: Self-supervised interest point de- tection and description, 2018

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Ra- binovich. Superpoint: Self-supervised interest point de- tection and description, 2018. URL https://arxiv.org/abs/ 1712.07629

  13. [13]

    Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904,

    Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

  14. [14]

    Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal consistency.arXiv preprint arXiv:2506.07497, 2025

    Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, and Xinggang Wang. Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal consis- tency, 2025. URL https://arxiv.org/abs/2506.07497

  15. [15]

    Re3 Sim: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation

    Xiaoshen Han, Minghuan Liu, Yilun Chen, Junqiu Yu, Xiaoyang Lyu, Yang Tian, Bolun Wang, Weinan Zhang, and Jiangmiao Pang. Re 3sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipulation, 2025. URL https://arxiv.org/abs/ 2502.08645

  16. [16]

    Disect: A differentiable simulation engine for autonomous robotic cutting.arXiv preprint arXiv:2105.12244, 2021

    Eric Heiden, Miles Macklin, Yashraj Narang, Dieter Fox, Animesh Garg, and Fabio Ramos. Disect: A differen- tiable simulation engine for autonomous robotic cutting. arXiv preprint arXiv:2105.12244, 2021

  17. [17]

    Tenenbaum, and Chuang Gan

    Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Plas- ticinelab: A soft-body manipulation benchmark with dif- ferentiable physics.arXiv preprint arXiv:2104.03311, 2021

  18. [18]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...

  19. [19]

    Po- laRiS: Scalable real-to-sim evaluations for generalist robot policies, 2025

    Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, Wei-Chiu Ma, Dhruv Shah, Abhishek Gupta, and Karl Pertsch. Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025. URL https://arxiv.org/ abs/2512.16881

  20. [20]

    Rlbench: The robot learning bench- mark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning bench- mark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  21. [21]

    Discoverse: Efficient robot simulation in complex high-fidelity environments,

    Yufei Jia, Guangyu Wang, Yuhang Dong, Junzhe Wu, Yupei Zeng, Haonan Lin, Zifan Wang, Haizhou Ge, Weibin Gu, Kairui Ding, Zike Yan, Yunjie Cheng, Yue Li, Ziming Wang, Chuxuan Li, Wei Sui, Lu Shi, Guanzhong Tian, Ruqi Huang, and Guyue Zhou. Discoverse: Effi- cient robot simulation in complex high-fidelity environ- ments, 2025. URL https://arxiv.org/abs/2507.21981

  22. [22]

    GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

    Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, and Xiaolong Wang. Gsworld: Closed-loop photo- realistic simulation suite for robotic manipulation, 2025. URL https://arxiv.org/abs/2510.20813

  23. [23]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023. URL https://arxiv.org/abs/ 2308.04079

  24. [24]

    Videofrom3d: 3d scene video generation via comple- mentary image and video diffusion models, 2025

    Geonung Kim, Janghyeok Han, and Sunghyun Cho. Videofrom3d: 3d scene video generation via comple- mentary image and video diffusion models, 2025. URL https://arxiv.org/abs/2509.17985

  25. [25]

    Hunyuanvideo: A systematic framework for large video generative models, 2025

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duo- jun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li...

  26. [26]

    Marble, 2025

    World Labs. Marble, 2025. URL https://marble. worldlabs.ai. Accessed: 2026-01-25

  27. [27]

    igibson 2.0: Object-centric simulation for robot learning of everyday household tasks

    Chengshu Li, Fei Xia, Roberto Mart ´ın-Mart´ın, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, et al. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks.arXiv preprint arXiv:2108.03272, 2021

  28. [28]

    Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gok- men, Sanjana Srivastava, Roberto Mart ´ın-Mart´ın, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning, pages 80–93. PMLR, 2023

  29. [29]

    Robogsim: A real2sim2real robotic gaus- sian splatting simulator,

    Xinhai Li, Jialin Li, Ziheng Zhang, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Kuo-Kun Tseng, and Ruiping Wang. Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025. URL https://arxiv. org/abs/2411.11839

  30. [30]

    Lehome: A simulation environment for deformable object manipulation in household scenarios

    Zeyi Li, Jade Yang, Jingkai Xu, Shangbin Xie, Yuran Wang, Zhenhao Shen, Tianxing Chen, Yan Shen, Wenjun Li, Yukun Zheng, Chaorui Zhang, Ming Chen, Chen Xie, and Ruihai Wu. Lehome: A simulation environment for deformable object manipulation in household scenarios. InIROS 2025 - 5th Workshop on RObotic MAnipula- tion of Deformable Objects: holistic approach...

  31. [31]

    LightGlue: Local Feature Matching at Light Speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. InICCV, 2023

  32. [32]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.NeurIPS, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.NeurIPS, 2023

  33. [33]

    Garmentlab: A unified simulation and benchmark for garment manipulation.NeurIPS, 2024

    Haoran Lu, Ruihai Wu, Yitong Li, Sijie Li, Ziyu Zhu, Chuanruo Ning, Yan Zhao, Longzan Luo, Yuanpei Chen, and Hao Dong. Garmentlab: A unified simulation and benchmark for garment manipulation.NeurIPS, 2024

  34. [34]

    Y ., Nasiriany, S., Xie, Y ., Fang, Y ., Huang, W., Wang, Z., Xu, Z., Chernyadev, N., et al

    Abhiram Maddukuri, Zhenyu Jiang, Lawrence Yunliang Chen, Soroush Nasiriany, Yuqi Xie, Yu Fang, Wenqi Huang, Zu Wang, Zhenjia Xu, Nikita Chernyadev, Scott Reed, Ken Goldberg, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Sim-and-real co-training: A simple recipe for vision-based robotic manipulation.ArXiv, abs/2503.24361, 2025. URL https://api.semanticschol...

  35. [35]

    Robotwin: Dual-arm robot benchmark with generative digital twins

    Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins. InCVPR, 2025

  36. [36]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

  37. [37]

    Isaac Sim

    NVIDIA. Isaac Sim. URL https://github.com/isaac-sim/ IsaacSim

  38. [38]

    NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopad- hyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge...

  39. [39]

    Sora, 2024

    OpenAI. Sora, 2024. URL https://openai.com/sora/. Accessed: 2026-01-25

  40. [40]

    Sora 2, 2025

    OpenAI. Sora 2, 2025. URL https://openai.com/ zh-Hans-CN/index/sora-2/. Accessed: 2026-01-25

  41. [41]

    Scalable real2sim: Physics-aware asset generation via robotic pick-and-place setups, 2025

    Nicholas Pfaff, Evelyn Fu, Jeremy Binagia, Phillip Isola, and Russ Tedrake. Scalable real2sim: Physics-aware asset generation via robotic pick-and-place setups, 2025. URL https://arxiv.org/abs/2503.00370

  42. [42]

    Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting, 2024

    Mohammad Nomaan Qureshi, Sparsh Garg, Francisco Yandun, David Held, George Kantor, and Abhisesh Sil- wal. Splatsim: Zero-shot sim2real transfer of rgb ma- nipulation policies using gaussian splatting, 2024. URL https://arxiv.org/abs/2409.10161

  43. [43]

    Gen3c: 3d- informed world-consistent video generation with precise camera control, 2025

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d- informed world-consistent video generation with precise camera control, 2025. URL https://arxiv.org/abs/2503. 03751

  44. [44]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016

  45. [45]

    Learning to Rearrange Deformable Cables, Fab- rics, and Bags with Goal-Conditioned Transporter Net- works

    Daniel Seita, Pete Florence, Jonathan Tompson, Erwin Coumans, Vikas Sindhwani, Ken Goldberg, and Andy Zeng. Learning to Rearrange Deformable Cables, Fab- rics, and Bags with Goal-Conditioned Transporter Net- works. InICRA, 2021

  46. [46]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Ca- puano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, An- dres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  47. [47]

    Habitat 2.0: Training home assistants to rearrange their habitat.NeurIPS, 2021

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.NeurIPS, 2021

  48. [48]

    arXiv preprint arXiv:2511.19861 (2025)

    GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. Gigaworld-0: World models as data engine to em...

  49. [49]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  50. [50]

    Thin-shell object manipulations with differentiable physics simulations

    Yian Wang, Juntian Zheng, Zhehuan Chen, Zhou Xian, Gu Zhang, Chao Liu, and Chuang Gan. Thin-shell object manipulations with differentiable physics simulations. In ICLR, 2023

  51. [51]

    Robogen: Towards un- leashing infinite data for automated robot learning via generative simulation, 2023

    Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards un- leashing infinite data for automated robot learning via generative simulation, 2023

  52. [52]

    Fluidlab: A differentiable environment for benchmarking complex fluid manipulation.arXiv preprint arXiv:2303.02346, 2023

    Zhou Xian, Bo Zhu, Zhenjia Xu, Hsiao-Yu Tung, Antonio Torralba, Katerina Fragkiadaki, and Chuang Gan. Fluidlab: A differentiable environment for bench- marking complex fluid manipulation.arXiv preprint arXiv:2303.02346, 2023

  53. [53]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In International Conference on Robotics and Automation (ICRA), 2024

  54. [54]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  55. [55]

    Tesseract: Learning 4d embodied world models, 2025

    Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models, 2025. URL https: //arxiv.org/abs/2504.20995. APPENDIX A. Technical Implementation Details Real-to-Sim Alignment and Physical Fidelity.To anchor assets and ensure stable contacts within the manipulation workspace, we extract ...