pith. sign in

arxiv: 2606.27962 · v1 · pith:ADJDY57Jnew · submitted 2026-06-26 · 💻 cs.RO

Building a Scalable, Reproducible, Evaluatable, and Closed-Loop Simulation Environment Foundation for Embodied Intelligence Cloud-Native Simulation Infrastructure for Embodied Intelligence Training, Evaluation, and Data Collection

Pith reviewed 2026-06-29 04:23 UTC · model grok-4.3

classification 💻 cs.RO
keywords cloud-nativesimulation infrastructureembodied intelligenceroboticsdata collectionmodel evaluationscalable platformclosed-loop
0
0 comments X

The pith

A cloud-native simulation infrastructure unifies data generation, training, evaluation, and deployment for embodied intelligence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that uses cloud-native technologies to create a scalable and reproducible simulation environment for embodied intelligence. It tackles the problems of high cost and poor reproducibility in real-world robotic data collection by employing elastic scheduling, containerization, and unified data management. The system features a four-layer architecture that automates environment generation, task execution, trajectory collection, and closed-loop optimization. This setup supports large-scale multi-model and multi-task workloads while integrating specific systems for visual augmentation and data filtering.

Core claim

The authors claim that cloud-native simulation infrastructure, through its four-layer architecture and adoption of elastic resource scheduling, containerized simulation, unified data management, and service-oriented design, provides a unified foundation for simulation environment generation, task execution, trajectory collection, model evaluation, data management, and cloud services in embodied intelligence.

What carries the argument

The four-layer cloud-native simulation infrastructure architecture that unifies environment assets, automated task generation, trajectory collection, benchmark evaluation, and closed-loop data optimization.

If this is right

  • Enables efficient large-scale simulation for multi-model and multi-task workloads.
  • Supports standardized evaluation and real-time data filtering through integrated systems.
  • Facilitates closed-loop data optimization for continuous improvement.
  • Provides a foundation for real-world deployment of embodied intelligence models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could enable researchers without access to physical robots to conduct large-scale experiments.
  • It might accelerate the development of embodied AI by making simulation data more accessible and consistent.
  • Future extensions could include direct transfer learning from simulation to real hardware.
  • Scalability claims could be tested by running thousands of parallel simulations and measuring resource utilization.

Load-bearing premise

That cloud-native technologies such as elastic scheduling and containerization will substantially reduce the cost, improve scalability, and enhance reproducibility compared to traditional robotic data collection methods.

What would settle it

A side-by-side experiment measuring total cost, number of successful trajectories per unit time, and variance in results between this framework and a non-cloud-native simulation setup.

read the original abstract

This paper presents a cloud-native simulation infrastructure framework for embodied intelligence that supports large-scale training, standardized evaluation, and simulation-based data collection. The framework unifies simulation environment generation, task execution, trajectory collection, model evaluation, data management, and cloud services into a scalable and reproducible platform. To address the high cost, limited scalability, and poor reproducibility of real-world robotic data collection, the framework adopts cloud-native technologies including elastic resource scheduling, containerized simulation, unified data management, and service-oriented system design, enabling efficient large-scale simulation for multi-model and multi-task workloads. Built on a four-layer architecture, the framework provides standardized environment assets, automated task generation, trajectory collection, benchmark evaluation, and closed-loop data optimization. It further integrates representative systems including D-VLA, RL-VLA3, Sword, and Pre-VLA to support scalable simulation, dynamic scheduling, visual augmentation, and real-time data filtering. We argue that cloud-native simulation infrastructure provides a unified foundation for data generation, model training, standardized evaluation, and real-world deployment, and will play a key role in the future development of embodied intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a cloud-native simulation infrastructure framework for embodied intelligence that unifies environment asset management, automated task generation, trajectory collection, benchmark evaluation, and closed-loop data optimization. It adopts elastic scheduling, containerization, and unified data management in a four-layer architecture, integrates with D-VLA, RL-VLA3, Sword, and Pre-VLA, and argues that this design addresses the high cost, limited scalability, and poor reproducibility of real-world robotic data collection while providing a foundation for training, evaluation, and deployment.

Significance. If implemented and quantitatively validated, the proposed infrastructure could offer a standardized, reproducible platform that lowers barriers to large-scale embodied AI experimentation. The manuscript supplies only a high-level systems description with no scaling curves, throughput numbers, cost comparisons, or reproducibility metrics, so its significance is currently prospective rather than demonstrated.

major comments (2)
  1. [Abstract] Abstract: the claim that cloud-native technologies 'enable efficient large-scale simulation for multi-model and multi-task workloads' and solve high cost, limited scalability, and poor reproducibility is presented without any supporting measurements, scaling experiments, or comparisons against non-cloud baselines.
  2. [Four-layer architecture description] Four-layer architecture and integration sections: benefits of the environment-assets / task-generation / trajectory-collection / benchmark-evaluation layers and the cited integrations (D-VLA, RL-VLA3, Sword, Pre-VLA) are asserted as design-enabled outcomes, yet no ablation studies, throughput figures, trajectory-variance statistics, or baseline comparisons are reported.
minor comments (1)
  1. The manuscript would benefit from explicit definitions or references for the concrete container orchestration and data-management protocols employed in the unified data-management layer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive criticism. The manuscript is intended as a systems paper describing a cloud-native simulation infrastructure framework. We agree with the observation that it lacks quantitative evaluations and will revise the text to more accurately reflect the scope and nature of the contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that cloud-native technologies 'enable efficient large-scale simulation for multi-model and multi-task workloads' and solve high cost, limited scalability, and poor reproducibility is presented without any supporting measurements, scaling experiments, or comparisons against non-cloud baselines.

    Authors: We accept this point. The abstract overstates the demonstrated benefits. In the revised manuscript, we will rephrase the abstract to present these as intended outcomes of the design rather than proven results, and we will add a discussion on the rationale behind the design choices that are expected to address these issues. revision: yes

  2. Referee: [Four-layer architecture description] Four-layer architecture and integration sections: benefits of the environment-assets / task-generation / trajectory-collection / benchmark-evaluation layers and the cited integrations (D-VLA, RL-VLA3, Sword, Pre-VLA) are asserted as design-enabled outcomes, yet no ablation studies, throughput figures, trajectory-variance statistics, or baseline comparisons are reported.

    Authors: The four-layer architecture is presented as a proposed structure to achieve the goals of scalability and reproducibility. The integrations are examples of systems that can leverage this infrastructure. We will revise these sections to clarify that the benefits are hypothesized based on the architecture and that no empirical studies are included in this work, as the paper focuses on the infrastructure foundation rather than specific performance metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: systems architecture paper with no derivations or predictions

full rationale

The manuscript is a descriptive systems paper proposing a four-layer cloud-native simulation framework. It contains no equations, no fitted parameters, no predictions of quantities, and no derivation chains. Claims about scalability and reproducibility are presented as enabled by the adopted technologies (elastic scheduling, containerization) rather than derived from any inputs or self-citations. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The central assertions reduce to design choices, not to any tautological reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced because the paper is a high-level systems description without mathematical content or new postulated objects.

pith-pipeline@v0.9.1-grok · 5768 in / 1020 out tokens · 26288 ms · 2026-06-29T04:23:34.117132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 14 linked inside Pith

  1. [1]

    RT-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, A vinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashniko...

  2. [2]

    Ryoo, Grecia Salazar, Pannag R

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deepak Manjunath, Igor Mordatch...

  3. [3]

    OpenVLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  4. [4]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Teodor Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

  5. [5]

    World models

    David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

  6. [6]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020

  7. [7]

    RL-VLA3: A flexible and asynchronous reinforcement learning framework for vla training

    Haoran Sun, Yongjian Guo, Zhong Guan, Shuai Di, Xiaodong Bai, Jing Long, Tianyun Zhao, Mingxi Luo, Hongke Zhao, Likang Wu, Xiaotie Deng, Xu Chu, Xi Xiao, Sheng Wen, Yicheng Gong, and Junwu Xiong. RL-VLA3: A flexible and asynchronous reinforcement learning framework for vla training. arXiv preprint arXiv:2602.05765, 2026

  8. [8]

    D-VLA: A high-concurrency distributed asynchronous reinforcement learning framework for vision-language-action models

    Yucheng Guo, Yongjian Guo, Zhong Guan, Wen Huang, Haoran Sun, Haodong Yue, Xiaolong Xiang, Shuai Di, Zhen Sun, Luqiao Wang, Junwu Xiong, and Yicheng Gong. D-VLA: A high-concurrency distributed asynchronous reinforcement learning framework for vision-language-action models. arXiv preprint arXiv:2605.13276, 2026

  9. [9]

    Robert E. Shannon. Introduction to the art and science of simulation. In Proceedings of the 30th Conference on Winter Simulation, pages 7–14, 1998

  10. [10]

    Domain random- ization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain random- ization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30, 2017

  11. [11]

    CAD2RL: Real single-image flight without a single real image

    Fereshteh Sadeghi and Sergey Levine. CAD2RL: Real single-image flight without a single real image. In Robotics: Science and Systems, 2017

  12. [12]

    Sim-to-real transfer of robotic control with dynamics randomization

    Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In IEEE International Conference on Robotics and Automation, pages 3803–3810, 2018

  13. [13]

    Isaac Gym: High performance GPU-based physics simulation for robot learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac Gym: High performance GPU-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021. 27

  14. [14]

    MuJoCo: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012

  15. [15]

    Chang, Leonidas J

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part- based interactive environment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020

  16. [16]

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark and learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  17. [17]

    ManiSkill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. ManiSkill2: A unified benchmark for generalizable manipulation skills. In International Conference on Learning Representations, 2023

  18. [18]

    LIBERO: Benchmarking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. In Advances in Neural Information Processing Systems, 2024

  19. [19]

    RoboCasa: Large-scale simulation of everyday tasks for generalist robots

    Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems, 2024

  20. [20]

    AI2-THOR: An interactive 3d environment for visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An interactive 3d environment for visual AI. arXiv preprint arXiv:1712.05474, 2017

  21. [21]

    Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel X. Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Trainin...

  22. [22]

    BEHA VIOR: Benchmark for everyday household activities in virtual, interactive, and ecological environments

    Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. BEHA VIOR: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, 2022

  23. [23]

    CAL VIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CAL VIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. In IEEE Robotics and Automation Letters, 2022

  24. [24]

    Open X-Embodiment: Robotic learning datasets and RT-X models

    Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. In IEEE International Conference on Robotics and Automation, 2024

  25. [25]

    DROID: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter Luo, Fan Qian, Ethan Zhu, Dibya Gandhi, Bradly Stadie, Austin Stone, Michael Chiang, Fei Xia, Chelsea Finn, and Sergey Levine. DROID: A large-scale in-the-wild robot man...

  26. [26]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. Conference on Robot Learning Workshop, 2023

  27. [27]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565, 2019

  28. [28]

    Sword: Style-robust world models as simulators via dynamic latent bootstrapping for vla policy post-training

    Jiaxuan Gao, Yongjian Guo, Zhong Guan, Wen Huang, Wanlun Ma, Xi Xiao, Junwu Xiong, and Sheng Wen. Sword: Style-robust world models as simulators via dynamic latent bootstrapping for vla policy post-training. arXiv preprint arXiv:2605.07288, 2026

  29. [29]

    Pre-vla: Preemptive runtime verification for reliable vision-language-action and world-model rollouts

    Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, and Zhijun Meng. Pre-vla: Preemptive runtime verification for reliable vision-language-action and world-model rollouts. arXiv preprint arXiv:2605.22446, 2026

  30. [30]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Robotics: Science and Systems, 2023. 28

  31. [31]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Robotics: Science and Systems, 2023

  32. [32]

    Design and use paradigms for Gazebo, an open-source multi-robot simulator

    Nathan Koenig and Andrew Howard. Design and use paradigms for Gazebo, an open-source multi-robot simulator. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2149–2154, 2004

  33. [33]

    Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y. Ng. ROS: An open-source robot operating system. In ICRA Workshop on Open Source Software, 2009

  34. [34]

    OpenAI Gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016

  35. [35]

    Deepmind control suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

  36. [36]

    robosuite: A modular simulation framework and benchmark for robot learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martin-Martin, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020

  37. [37]

    PyBullet, a Python module for physics simulation for games, robotics and machine learning

    Erwin Coumans and Yunfei Bai. PyBullet, a Python module for physics simulation for games, robotics and machine learning. GitHub repository, 2016

  38. [38]

    Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100, 2020

  39. [39]

    Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

    Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: Real- world perception for embodied agents. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9068–9079, 2018

  40. [40]

    iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes

    Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Shyamal Buch, Claudia D’Arpino, Sanjana Srivastava, Lyne Tchapmi, Kent Vainio, James Wong, Li Fei-Fei, and Silvio Savarese. iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In IEEE/RSJ International Conference on Intelligent Robots and S...

  41. [41]

    Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang

    Angel X. Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In International Conference on 3D Vision, pages 667–676, 2017

  42. [42]

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Ming Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M. Stra...

  43. [43]

    ProcTHOR: Large-scale embodied AI using procedural generation

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-scale embodied AI using procedural generation. In Advances in Neural Information Processing Systems, 2022

  44. [44]

    Habitat: A platform for embodied AI research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied AI research. In IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

  45. [45]

    Vision-and-language navigation: Interpreting visually-grounded navigation instruc- tions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instruc- tions in real environments. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018

  46. [46]

    ALFRED: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettle- moyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020. 29

  47. [47]

    TEACh: Task-driven embodied agents that chat

    Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. TEACh: Task-driven embodied agents that chat. In AAAI Conference on Artificial Intelligence, pages 2017–2025, 2022

  48. [48]

    Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics

    Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In Robotics: Science and Systems, 2017

  49. [49]

    End-to-end training of deep visuomotor policies

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. In Journal of Machine Learning Research, volume 17, pages 1–40, 2016

  50. [50]

    QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673, 2018

  51. [51]

    RoboNet: Large-scale multi-robot learning

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-scale multi-robot learning. In Conference on Robot Learning, pages 885–897, 2019

  52. [52]

    What matters in learning from offline human demonstrations for robot manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning, 2021

  53. [53]

    Mastering diverse domains through world models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  54. [54]

    Mas- tering atari, go, chess and shogi by planning with a learned model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mas- tering atari, go, chess and shogi by planning with a learned model. Nature, 588:604–609, 2020

  55. [55]

    A generalist agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth- Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. Transactions on...

  56. [56]

    Do as i can, not as i say: Grounding language in robotic affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang...

  57. [57]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodie...

  58. [58]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tianhe Yu Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, 2022

  59. [59]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation, pages 9493–9500, 2023. 30