pith. sign in

arxiv: 2606.18646 · v1 · pith:6SLDUL5Xnew · submitted 2026-06-17 · 💻 cs.RO

A Scalable Embodied Intelligence Platform for Seamless Real-to-Sim-to-Real Transfer of Household Mobile Manipulation Tasks

Pith reviewed 2026-06-26 20:59 UTC · model grok-4.3

classification 💻 cs.RO
keywords mobile manipulationsim-to-real transferembodied intelligenceautomated scene generationrobot middlewarehousehold roboticsskill learning
0
0 comments X

The pith

BestMan platform automates scene generation and provides unified middleware to enable seamless real-to-sim-to-real transfer for household mobile manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome three obstacles that block progress in mobile manipulation for homes: the high cost of building accurate simulation scenes by hand, the difficulty of testing many strategies systematically, and the challenge of moving those strategies to different real robots without major rework. It presents BestMan as a platform whose Automated Scene Generation module turns real observations into usable simulations automatically, whose simulation-guided architecture lets researchers combine and evaluate hybrid skills at large scale inside the simulator, and whose Hardware-agnostic and Unified Middleware makes the same code run on varied physical robots. A sympathetic reader would see this as a way to shorten the cycle from idea to working system, allowing more strategies to be tried cheaply before real-world tests and creating shared benchmarks that different labs can compare directly.

Core claim

BestMan is a scalable and seamless real-to-sim-to-real platform that bridges the gap between the simulation and the real world, enabling effective strategy development, integration, and deployment for household mobile manipulation. It consists of a novel Automated Scene Generation module to reconstruct realistic simulations from real observations, a simulation-guided task formalization and skill learning architecture that supports flexible integration and large-scale evaluations of hybrid skill strategies in simulation, and a Hardware-agnostic and Unified Middleware to ensure seamless and compatible sim-to-real transfer across heterogeneous mobile manipulators for real deployments.

What carries the argument

The BestMan platform, built around the Automated Scene Generation (ASG) module for observation-based simulation reconstruction, the simulation-guided task formalization and skill learning architecture for strategy evaluation, and the Hardware-agnostic and Unified Middleware (HUM) for cross-robot compatibility.

If this is right

  • Enables large-scale evaluations of hybrid skill strategies inside simulation before any real-world testing.
  • Supports standardized benchmarks that different research groups can use to compare mobile manipulation approaches.
  • Allows the same learned strategies to transfer to heterogeneous mobile manipulators with minimal hardware-specific changes.
  • Reduces reliance on expensive manual scene reconstruction for each new household environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Wider adoption could let smaller labs run far more manipulation experiments by shifting most testing into simulation.
  • The middleware layer might make it practical to share complete task solutions across labs that own different robot models.
  • If the automated scene generation generalizes beyond static rooms, the same pipeline could support tasks that involve moving objects or people.

Load-bearing premise

The Automated Scene Generation module can create simulations from real observations that are accurate enough for reliable strategy evaluation and successful transfer to physical robots without costly manual high-fidelity reconstruction.

What would settle it

A controlled comparison in which strategies developed and evaluated inside BestMan simulations show no measurable improvement in real-robot success rates or transfer efficiency compared with strategies developed using existing manual or non-automated simulation pipelines.

read the original abstract

Mobile manipulation is a fundamental capability in embodied intelligence robotics. The growing demand for robust and generalizable manipulation in unstructured household environments has driven rapid progress in embodied intelligence platforms. However, achieving a seamless transfer across the real-to-sim-to-real cycle faces three key challenges, including costly high-fidelity simulation scenes reconstruction, the complexity of systematic strategy evaluation in simulation, and incompatible real-world deployments. To address these challenges, we develop BestMan, a scalable and seamless real-to-sim-to-real platform that bridges the gap between the simulation and the real world, enabling effective strategy development, integration, and deployment for household mobile manipulation. Specifically, we design a novel Automated Scene Generation (ASG) module to reconstruct realistic simulations from real observations. Then, we propose a simulation-guided task formalization and skill learning architecture that supports the flexible integration and large-scale evaluations of hybrid skill strategies in simulation. Finally, to enhance the real-world scalability, we develop a Hardware-agnostic and Unified Middleware (HUM) to ensure seamless and compatible sim-to-real transfer across heterogeneous mobile manipulators for real deployments. Experimental results demonstrate the superior performance of our proposed platform in establishing standardized benchmarks and facilitating promising research in the field of mobile manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces BestMan, a scalable platform for real-to-sim-to-real transfer of household mobile manipulation tasks. It identifies three challenges (costly high-fidelity scene reconstruction, complex strategy evaluation in simulation, and incompatible real-world deployments) and proposes three components to address them: an Automated Scene Generation (ASG) module that reconstructs realistic simulations from real observations, a simulation-guided task formalization and skill learning architecture for flexible integration and large-scale evaluation of hybrid skill strategies, and a Hardware-agnostic and Unified Middleware (HUM) for seamless transfer across heterogeneous mobile manipulators. The abstract states that experimental results demonstrate superior performance in establishing standardized benchmarks.

Significance. If the experimental claims hold and the ASG module produces simulations sufficiently accurate for strategy evaluation and transfer, the platform could provide valuable shared infrastructure for embodied AI research, lowering the cost of scene setup and enabling reproducible large-scale testing of mobile manipulation strategies before real deployment.

major comments (2)
  1. [Abstract] Abstract: the claim that 'Experimental results demonstrate the superior performance of our proposed platform in establishing standardized benchmarks' is unsupported by any quantitative results, error bars, baseline comparisons, task definitions, robot platforms, or dataset details. Without these, the central claim of seamless real-to-sim-to-real transfer cannot be evaluated.
  2. [ASG module description] ASG module description (Abstract): the assertion that ASG 'reconstructs realistic simulations from real observations' supplies no reconstruction method (sensor fusion, object pose estimation, material parameter fitting, etc.), no quantitative fidelity measures (geometric error, dynamics match, visual domain gap), and no transfer success rates versus manual reconstruction. This leaves the load-bearing assumption that ASG-generated scenes are accurate enough for strategy evaluation and sim-to-real transfer untestable.
minor comments (1)
  1. The term 'hybrid skill strategies' is used without definition or examples of what constitutes a hybrid strategy or how the architecture supports their integration and evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract requires revision to better substantiate its claims with references to the quantitative content in the full manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'Experimental results demonstrate the superior performance of our proposed platform in establishing standardized benchmarks' is unsupported by any quantitative results, error bars, baseline comparisons, task definitions, robot platforms, or dataset details. Without these, the central claim of seamless real-to-sim-to-real transfer cannot be evaluated.

    Authors: The abstract serves as a concise summary. The full manuscript contains an Experiments section (Section 5) that reports quantitative results with error bars, baseline comparisons, explicit task definitions, multiple robot platforms, and dataset details supporting the real-to-sim-to-real transfer claims. We will revise the abstract to include a brief summary of the key quantitative findings (e.g., success rates and benchmark comparisons) to make the central claim directly traceable to the reported evidence. revision: yes

  2. Referee: [ASG module description] ASG module description (Abstract): the assertion that ASG 'reconstructs realistic simulations from real observations' supplies no reconstruction method (sensor fusion, object pose estimation, material parameter fitting, etc.), no quantitative fidelity measures (geometric error, dynamics match, visual domain gap), and no transfer success rates versus manual reconstruction. This leaves the load-bearing assumption that ASG-generated scenes are accurate enough for strategy evaluation and sim-to-real transfer untestable.

    Authors: The ASG module is described in detail in Section 3.1 of the full manuscript, which specifies the reconstruction pipeline (RGB-D sensor fusion, object pose estimation, and material parameter fitting) along with quantitative fidelity metrics (geometric error, dynamics match, visual domain gap) and transfer success rates compared against manual reconstruction. These evaluations demonstrate that ASG scenes are sufficiently accurate for strategy evaluation. We will revise the abstract to include a short clause referencing the reconstruction approach and fidelity results reported in the body of the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: system description with experimental claims, no derivations or fitted reductions

full rationale

The paper describes an engineering platform (BestMan) consisting of ASG for scene reconstruction, a simulation-guided architecture for skill learning, and HUM middleware for deployment. No equations, parameters fitted to data subsets, or derivation chains are present in the provided text. Central claims rest on experimental demonstration rather than any step that reduces by construction to its own inputs or to self-citations. This is the expected non-circular outcome for a system paper whose weakest assumption concerns empirical fidelity rather than mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The platform introduces three named modules as contributions; these are treated as invented entities because they are presented as novel solutions without independent external validation cited in the abstract. No free parameters or explicit axioms are stated.

invented entities (3)
  • Automated Scene Generation (ASG) module no independent evidence
    purpose: Reconstruct realistic simulation scenes from real observations to avoid costly manual reconstruction
    Presented as a novel component addressing the first challenge; no external evidence of accuracy provided in abstract.
  • simulation-guided task formalization and skill learning architecture no independent evidence
    purpose: Support flexible integration and large-scale evaluation of hybrid skill strategies
    Introduced to address complexity of strategy evaluation; no prior citation or independent validation referenced.
  • Hardware-agnostic and Unified Middleware (HUM) no independent evidence
    purpose: Ensure seamless sim-to-real transfer across heterogeneous mobile manipulators
    Presented as the solution to incompatible deployments; no external evidence of compatibility shown.

pith-pipeline@v0.9.1-grok · 5759 in / 1377 out tokens · 21688 ms · 2026-06-26T20:59:07.670389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 7 canonical work pages

  1. [1]

    IEEE Transactions on Emerging Topics in Computational Intelligence6(2), 230–244 (2022) https://doi.org/10.1109/TETCI.2022.3141105

    Duan, J., Yu, S., Tan, H.L., Zhu, H., Tan, C.: A survey of embodied ai: From simu- lators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence6(2), 230–244 (2022) https://doi.org/10.1109/TETCI.2022.3141105

  2. [2]

    IEEE/ASME 27 Transactions on Mechatronics30(6), 7253–7274 (2025) https://doi.org/10.1109/ TMECH.2025.3574943

    Liu, Y., Chen, W., Bai, Y., Liang, X., Li, G., Gao, W., Lin, L.: Aligning cyber space with physical world: A comprehensive survey on embodied ai. IEEE/ASME 27 Transactions on Mechatronics30(6), 7253–7274 (2025) https://doi.org/10.1109/ TMECH.2025.3574943

  3. [3]

    CCF Transactions on Pervasive Computing and Interaction, 1–22 (2025)

    Tian, Y., Shi, M., Zhang, X., Zhang, B., Wang, M., Shi, Y.: Assisting embodied ai: a survey of 3d segmentation models for medical ct images. CCF Transactions on Pervasive Computing and Interaction, 1–22 (2025)

  4. [4]

    Frontiers of Computer Science 19(4), 194203 (2025)

    Wang, R., Mou, X., Wo, T., Zhang, M., Liu, Y., Wang, T., Liu, P., Yan, J., Liu, X.: Acbot: an iiot platform for industrial robots. Frontiers of Computer Science 19(4), 194203 (2025)

  5. [5]

    Journal of Mechanisms and Robotics15(2), 020801 (2022) https://doi.org/10

    Thakar, S., Srinivasan, S., Al-Hussaini, S., Bhatt, P.M., Rajendran, P., Jung Yoon, Y., Dhanaraj, N., Malhan, R.K., Schmid, M., Krovi, V.N., Gupta, S.K.: A survey of wheeled mobile manipulation: A decision-making perspective. Journal of Mechanisms and Robotics15(2), 020801 (2022) https://doi.org/10. 1115/1.4054611

  6. [6]

    IEEE Robotics and Automation Letters9(10), 8298–8305 (2024) https://doi.org/10.1109/LRA.2024.3441495

    Honerkamp, D., B¨ uchner, M., Despinoy, F., Welschehold, T., Valada, A.: Language-grounded dynamic scene graphs for interactive object search with mobile manipulation. IEEE Robotics and Automation Letters9(10), 8298–8305 (2024) https://doi.org/10.1109/LRA.2024.3441495

  7. [7]

    In: 13th International Conference on Learning Representations, ICLR 2025, pp

    Liu, Y., Liang, J.C., Tang, R., Lee, Y., Rabbani, M., Dianat, S., Rao, R., Huang, L., Liu, D., Wang, Q.,et al.: Re-imagining multimodal instruction tuning: A rep- resentation view. In: 13th International Conference on Learning Representations, ICLR 2025, pp. 102827–102850 (2025). International Conference on Learning Representations, ICLR

  8. [8]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Su, H., Xie, M., Cao, N., Ding, Y., Shao, B., Long, X., Gu, F., Chen, C.: Ova- fields: Weakly supervised open-vocabulary affordance fields for robot operational 28 part detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6385–6395 (2025)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Wang, J., Cao, N., Ding, Y., Xie, M., Gu, F., Chen, C.: Ske-layout: Spatial knowl- edge enhanced layout generation with llms. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 19414–19423 (2025)

  10. [10]

    https://arxiv.org/abs/2403.19940

    Shao, B., Cao, N., Ding, Y., Wang, X., Gu, F., Chen, C.: MoMa-Pos: An Efficient Object-Kinematic-Aware Base Placement Optimization Framework for Mobile Manipulation (2024). https://arxiv.org/abs/2403.19940

  11. [11]

    CCF Transactions on Pervasive Computing and Interaction, 1–16 (2025)

    Zhang, C., Chen, J., Geng, Y., Ge, J., Wang, D., Li, N., Zhang, Q., Zhang, T., Ji, M., Fu, T.: A global collaborative scheduling method for embedded artificial intelligence task offloading in a multi-cloud environment. CCF Transactions on Pervasive Computing and Interaction, 1–16 (2025)

  12. [12]

    In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat

    Koenig, N., Howard, A.: Design and use paradigms for gazebo, an open-source multi-robot simulator. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), vol. 3, pp. 2149–21543 (2004). https://doi.org/10.1109/IROS.2004.1389727

  13. [13]

    doi:10.1109/IROS.2012.6386109

    Todorov, E., Erez, T., Tassa, Y.: Mujoco: A physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033 (2012). https://doi.org/10.1109/IROS.2012.6386109

  14. [14]

    In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., Yi, L., Chang, A.X., Guibas, L.J., Su, H.: Sapien: A simulated part- based interactive environment. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11094–11104 (2020). https://doi. org/10.1109/CVPR42600.2020.01111 29

  15. [15]

    Virtualhome: Simulating household activities via programs

    Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S., Torralba, A.: Vir- tualhome: Simulating household activities via programs. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8494–8502 (2018). https://doi.org/10.1109/CVPR.2018.00886

  16. [16]

    In: The Twelfth International Conference on Learning Representations (2024).https://openreview.net/forum?id=4znwzG92CE

    Puig, X., Undersander, E., Szot, A., Cote, M.D., Yang, T.-Y., Partsey, R., Desai, R., Clegg, A., Hlavac, M., Min, S.Y., Vondruˇ s, V., Gervet, T., Berges, V.-P., Turner, J.M., Maksymets, O., Kira, Z., Kalakrishnan, M., Malik, J., Chaplot, D.S., Jain, U., Batra, D., Rai, A., Mottaghi, R.: Habitat 3.0: A co-habitat for humans, avatars, and robots. In: The T...

  17. [17]

    In: RSS 2024 Workshop: Data Generation for Robotics (2024)

    Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Man- dlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: RSS 2024 Workshop: Data Generation for Robotics (2024). https://openreview.net/forum?id=mHxHdTaRLa

  18. [18]

    In: Conference on Robot Learning, pp

    Li, C., Xia, F., Mart´ ın-Mart´ ın, R., Lingelbach, M., Srivastava, S., Shen, B., Vainio, K.E., Gokmen, C., Dharan, G., Jain, T.,et al.: igibson 2.0: Object-centric sim- ulation for robot learning of everyday household tasks. In: Conference on Robot Learning, pp. 455–465 (2022). PMLR

  19. [19]

    In: Conference on Robot Learning, pp

    Yenamandra, S., Ramachandran, A., Yadav, K., Wang, A.S., Khanna, M., Gervet, T., Yang, T.-Y., Jain, V., Clegg, A., Turner, J.M.,et al.: Homerobot: Open- vocabulary mobile manipulation. In: Conference on Robot Learning, pp. 1975– 2011 (2023). PMLR

  20. [20]

    arXiv preprint arXiv:2401.12202 (2024) 30

    Liu, P., Orru, Y., Paxton, C., Shafiullah, N.M.M., Pinto, L.: OK-Robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202 (2024) 30

  21. [21]

    In: ICRA Workshop on Open Source Software, vol

    Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R., Ng, A.Y.,et al.: Ros: an open-source robot operating system. In: ICRA Workshop on Open Source Software, vol. 3, p. 5 (2009). Kobe

  22. [22]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA), pp

    Zhi, P., Zhang, Z., Zhao, Y., Han, M., Zhang, Z., Li, Z., Jiao, Z., Jia, B., Huang, S.: Closed-loop open-vocabulary mobile manipulation with gpt-4v. In: 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 4761–4767 (2025). IEEE

  23. [23]

    IEEE Robotics and Automation Letters8(6), 3740– 3747 (2023)

    Mittal, M., Yu, C., Yu, Q., Liu, J., Rudin, N., Hoeller, D., Yuan, J.L., Singh, R., Guo, Y., Mazhar, H.,et al.: Orbit: A unified simulation framework for interactive robot learning environments. IEEE Robotics and Automation Letters8(6), 3740– 3747 (2023)

  24. [24]

    https://arxiv.org/abs/2009.12293

    Zhu, Y., Wong, J., Mandlekar, A., Mart´ ın-Mart´ ın, R., Joshi, A., Lin, K., Mad- dukuri, A., Nasiriany, S., Zhu, Y.: robosuite: A Modular Simulation Framework and Benchmark for Robot Learning (2025). https://arxiv.org/abs/2009.12293

  25. [25]

    In: 2022 International Conference on Robotics and Automation (ICRA), pp

    Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T.B., Vanhoucke, V.: Google scanned objects: A high-quality dataset of 3d scanned household items. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 2553–2560 (2022). IEEE

  26. [26]

    https: //arxiv.org/abs/2410.02193

    Yang, Z., Garrett, C., Fox, D., Lozano-P´ erez, T., Kaelbling, L.P.: Guiding Long- Horizon Task and Motion Planning with Vision Language Models (2024). https: //arxiv.org/abs/2410.02193

  27. [27]

    In: 2024 IEEE International Conference on 31 Robotics and Automation (ICRA), pp

    Sermanet, P., Ding, T., Zhao, J., Xia, F., Dwibedi, D., Gopalakrishnan, K., Chan, C., Dulac-Arnold, G., Maddineni, S., Joshi, N.J.,et al.: Robovqa: Multimodal long-horizon reasoning for robotics. In: 2024 IEEE International Conference on 31 Robotics and Automation (ICRA), pp. 645–652 (2024). IEEE

  28. [28]

    Barron, Ben Mildenhall, Dor Verbin, Pratul P

    Han, C., Wang, Q., Cui, Y., Cao, Z., Wang, W., Qi, S., Liu, D.: E2vpt: An effective and efficient approach for visual prompt tuning. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 17445–17456 (2023). https://doi.org/10.1109/ICCV51070.2023.01604

  29. [29]

    In: European Conference on Computer Vision, pp

    Han, C., Wang, Q., Dianat, S.A., Rabbani, M., Rao, R.M., Fang, Y., Guan, Q., Huang, L., Liu, D.: Amd: Automatic multi-step distillation of large-scale vision models. In: European Conference on Computer Vision, pp. 431–450 (2024). Springer

  30. [30]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp

    Neary, C., Ellis, C., Samyal, A.S., Lennon, C., Topcu, U.: A multifidelity sim- to-real pipeline for verifiable and compositional reinforcement learning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 4349– 4355 (2024). IEEE

  31. [31]

    Frontiers of Computer Science19(9), 1–3 (2025)

    Yang, K., Cao, N., Shao, B., Wang, X., Ding, Y., Chen, C.: Bestman: a modular mobile manipulator platform for embodied ai with unified simulation-hardware apis. Frontiers of Computer Science19(9), 1–3 (2025)

  32. [32]

    Coumans, E., Bai, Y.: Pybullet, a python module for physics simulation for games, robotics and machine learning (2016)

  33. [33]

    https://www.blender.org

    Blender - a 3D modelling and rendering package. https://www.blender.org. Accessed: 2025-02-20 (2023)

  34. [34]

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks (2024) 32

  35. [35]

    Advances in Neural Information Processing Systems37, 21875– 21911 (2024)

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875– 21911 (2024)

  36. [36]

    In: International Conference on Machine Learning, pp

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PmLR

  37. [37]

    Transactions on Machine Learning Research Journal, 1–31 (2024)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal, 1–31 (2024)

  38. [38]

    In: Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL2023 (2023)

    Chen, Q., Memmel, M., Fang, A., Walsman, A., Fox, D., Gupta, A.: URDFormer: Constructing interactive realistic scenes from real images via simulation and generative modeling. In: Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL2023 (2023). https://openreview.net/forum?id=bcjpfb6Bh9

  39. [39]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Lin, J., Zhang, L., Lee, K., Ning, J., Goldfeder, J., Lipson, H.: Autourdf: Unsu- pervised robot modeling from point cloud frames using cluster registration. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 27628–27637 (2025)

  40. [40]

    The International Journal of Robotics Research36(3), 261–268 (2017)

    Calli, B., Singh, A., Bruce, J., Walsman, A., Konolige, K., Srinivasa, S., Abbeel, P., Dollar, A.M.: Yale-cmu-berkeley dataset for robotic manipulation research. The International Journal of Robotics Research36(3), 261–268 (2017)

  41. [41]

    In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

    Lindermayr, J., Odabasi, C., Jordan, F., Graf, F., Knak, L., Kraus, W., Bormann, 33 R., Huber, M.F.: IPA-3D1K: a large retail 3d model dataset for robot picking. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11404–11411 (2023). IEEE

  42. [42]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object under- standing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2019)

  43. [43]

    https://github.com/luca-medeiros/ lang-segment-anything

    Lang Segment Anything. https://github.com/luca-medeiros/ lang-segment-anything. Accessed: 2025-02-20 (2022)

  44. [44]

    IEEE Transactions on Robotics39(5), 3929–3945 (2023)

    Fang, H.-S., Wang, C., Fang, H., Gou, M., Liu, J., Yan, H., Liu, W., Xie, Y., Lu, C.: Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics39(5), 3929–3945 (2023)

  45. [45]

    In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp

    Sundermeyer, M., Mousavian, A., Triebel, R., Fox, D.: Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13438–13444 (2021). IEEE

  46. [46]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Li, G., Jampani, V., Sun, D., Sevilla-Lara, L.: Locate: Localize and transfer object parts for weakly supervised affordance grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10922–10931 (2023)

  47. [47]

    IEEE Robotics & Automation Magazine19(4), 72–82 (2012)

    Sucan, I.A., Moll, M., Kavraki, L.E.: The open motion planning library. IEEE Robotics & Automation Magazine19(4), 72–82 (2012)

  48. [48]

    In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp

    Rohmer, E., Singh, S.P., Freese, M.: V-REP: A versatile and scalable robot sim- ulation framework. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1321–1326 (2013). IEEE 34

  49. [49]

    arXiv preprint arXiv:1712.05474 (2017)

    Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Deitke, M., Ehsani, K., Gordon, D., Zhu, Y., et al.: Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017)

  50. [50]

    IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

    James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

  51. [51]

    arXiv preprint arXiv:2410.00425 (2024)

    Tao, S., Xiang, F., Shukla, A., Qin, Y., Hinrichsen, X., Yuan, X., Bao, C., Lin, X., Liu, Y., Chan, T.-k., et al.: Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425 (2024)

  52. [52]

    In: Conference on Robot Learning, pp

    Dai, T., Wong, J., Jiang, Y., Wang, C., Gokmen, C., Zhang, R., Wu, J., Fei-Fei, L.: Automated creation of digital cousins for robust policy learning. In: Conference on Robot Learning, pp. 4912–4943 (2025). PMLR

  53. [53]

    arXiv preprint arXiv:2309.13707 (2023)

    Gao, K., Ding, Y., Zhang, S., Yu, J.: ORLA*: Mobile manipulator-based object rearrangement with lazy a. arXiv preprint arXiv:2309.13707 (2023)

  54. [54]

    arXiv preprint arXiv:2409.16030 (2024) 35

    Yu, W., Peng, J., Ying, Y., Li, S., Ji, J., Zhang, Y.: MHRC: Closed-loop decentral- ized multi-heterogeneous robot collaboration with large language models. arXiv preprint arXiv:2409.16030 (2024) 35