Building a Scalable, Reproducible, Evaluatable, and Closed-Loop Simulation Environment Foundation for Embodied Intelligence Cloud-Native Simulation Infrastructure for Embodied Intelligence Training, Evaluation, and Data Collection

Junwu Xiong; Lei Kang; Mingxi Luo; Ning Qiao; Song Wang; Yince Gao; Yongjian Guo

arxiv: 2606.27962 · v1 · pith:ADJDY57Jnew · submitted 2026-06-26 · 💻 cs.RO

Building a Scalable, Reproducible, Evaluatable, and Closed-Loop Simulation Environment Foundation for Embodied Intelligence Cloud-Native Simulation Infrastructure for Embodied Intelligence Training, Evaluation, and Data Collection

Junwu Xiong , Yongjian Guo , Mingxi Luo , Ning Qiao , Lei Kang , Song Wang , Yince Gao This is my paper

Pith reviewed 2026-06-29 04:23 UTC · model grok-4.3

classification 💻 cs.RO

keywords cloud-nativesimulation infrastructureembodied intelligenceroboticsdata collectionmodel evaluationscalable platformclosed-loop

0 comments

The pith

A cloud-native simulation infrastructure unifies data generation, training, evaluation, and deployment for embodied intelligence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that uses cloud-native technologies to create a scalable and reproducible simulation environment for embodied intelligence. It tackles the problems of high cost and poor reproducibility in real-world robotic data collection by employing elastic scheduling, containerization, and unified data management. The system features a four-layer architecture that automates environment generation, task execution, trajectory collection, and closed-loop optimization. This setup supports large-scale multi-model and multi-task workloads while integrating specific systems for visual augmentation and data filtering.

Core claim

The authors claim that cloud-native simulation infrastructure, through its four-layer architecture and adoption of elastic resource scheduling, containerized simulation, unified data management, and service-oriented design, provides a unified foundation for simulation environment generation, task execution, trajectory collection, model evaluation, data management, and cloud services in embodied intelligence.

What carries the argument

The four-layer cloud-native simulation infrastructure architecture that unifies environment assets, automated task generation, trajectory collection, benchmark evaluation, and closed-loop data optimization.

If this is right

Enables efficient large-scale simulation for multi-model and multi-task workloads.
Supports standardized evaluation and real-time data filtering through integrated systems.
Facilitates closed-loop data optimization for continuous improvement.
Provides a foundation for real-world deployment of embodied intelligence models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could enable researchers without access to physical robots to conduct large-scale experiments.
It might accelerate the development of embodied AI by making simulation data more accessible and consistent.
Future extensions could include direct transfer learning from simulation to real hardware.
Scalability claims could be tested by running thousands of parallel simulations and measuring resource utilization.

Load-bearing premise

That cloud-native technologies such as elastic scheduling and containerization will substantially reduce the cost, improve scalability, and enhance reproducibility compared to traditional robotic data collection methods.

What would settle it

A side-by-side experiment measuring total cost, number of successful trajectories per unit time, and variance in results between this framework and a non-cloud-native simulation setup.

read the original abstract

This paper presents a cloud-native simulation infrastructure framework for embodied intelligence that supports large-scale training, standardized evaluation, and simulation-based data collection. The framework unifies simulation environment generation, task execution, trajectory collection, model evaluation, data management, and cloud services into a scalable and reproducible platform. To address the high cost, limited scalability, and poor reproducibility of real-world robotic data collection, the framework adopts cloud-native technologies including elastic resource scheduling, containerized simulation, unified data management, and service-oriented system design, enabling efficient large-scale simulation for multi-model and multi-task workloads. Built on a four-layer architecture, the framework provides standardized environment assets, automated task generation, trajectory collection, benchmark evaluation, and closed-loop data optimization. It further integrates representative systems including D-VLA, RL-VLA3, Sword, and Pre-VLA to support scalable simulation, dynamic scheduling, visual augmentation, and real-time data filtering. We argue that cloud-native simulation infrastructure provides a unified foundation for data generation, model training, standardized evaluation, and real-world deployment, and will play a key role in the future development of embodied intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a framework proposal for cloud-native embodied AI simulation with no performance data or comparisons to back its claims.

read the letter

The paper proposes a cloud-native simulation infrastructure for embodied intelligence but presents it as a framework without any supporting measurements or comparisons.

What stands out is the four-layer design that covers standardized environment assets, automated task generation, trajectory collection, benchmark evaluation, and closed-loop optimization. The authors also describe how it ties into D-VLA, RL-VLA3, Sword, and Pre-VLA for things like dynamic scheduling and data filtering. This gives a concrete picture of how the pieces could fit together in a cloud setting.

The approach draws on familiar cloud practices like containerization and elastic resource scheduling to tackle cost, scale, and reproducibility in robotic data collection. That synthesis is useful as a reference.

However, the paper makes strong claims about solving those problems through the design alone. There are no reported results on throughput, cost savings, reproducibility across runs, or performance against existing simulators. The benefits are stated as following from the architecture rather than demonstrated.

The stress-test note is on point: the central argument lacks quantitative backing.

This paper would mainly interest engineers and researchers who are already setting up large simulation environments for embodied AI and want ideas on cloud integration. A reader seeking new methods or validated improvements will not find them here.

I would not cite it or bring it to a reading group. It does not merit peer review in this form because the lack of evidence makes it hard to assess whether the framework delivers on its promises.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a cloud-native simulation infrastructure framework for embodied intelligence that unifies environment asset management, automated task generation, trajectory collection, benchmark evaluation, and closed-loop data optimization. It adopts elastic scheduling, containerization, and unified data management in a four-layer architecture, integrates with D-VLA, RL-VLA3, Sword, and Pre-VLA, and argues that this design addresses the high cost, limited scalability, and poor reproducibility of real-world robotic data collection while providing a foundation for training, evaluation, and deployment.

Significance. If implemented and quantitatively validated, the proposed infrastructure could offer a standardized, reproducible platform that lowers barriers to large-scale embodied AI experimentation. The manuscript supplies only a high-level systems description with no scaling curves, throughput numbers, cost comparisons, or reproducibility metrics, so its significance is currently prospective rather than demonstrated.

major comments (2)

[Abstract] Abstract: the claim that cloud-native technologies 'enable efficient large-scale simulation for multi-model and multi-task workloads' and solve high cost, limited scalability, and poor reproducibility is presented without any supporting measurements, scaling experiments, or comparisons against non-cloud baselines.
[Four-layer architecture description] Four-layer architecture and integration sections: benefits of the environment-assets / task-generation / trajectory-collection / benchmark-evaluation layers and the cited integrations (D-VLA, RL-VLA3, Sword, Pre-VLA) are asserted as design-enabled outcomes, yet no ablation studies, throughput figures, trajectory-variance statistics, or baseline comparisons are reported.

minor comments (1)

The manuscript would benefit from explicit definitions or references for the concrete container orchestration and data-management protocols employed in the unified data-management layer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive criticism. The manuscript is intended as a systems paper describing a cloud-native simulation infrastructure framework. We agree with the observation that it lacks quantitative evaluations and will revise the text to more accurately reflect the scope and nature of the contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that cloud-native technologies 'enable efficient large-scale simulation for multi-model and multi-task workloads' and solve high cost, limited scalability, and poor reproducibility is presented without any supporting measurements, scaling experiments, or comparisons against non-cloud baselines.

Authors: We accept this point. The abstract overstates the demonstrated benefits. In the revised manuscript, we will rephrase the abstract to present these as intended outcomes of the design rather than proven results, and we will add a discussion on the rationale behind the design choices that are expected to address these issues. revision: yes
Referee: [Four-layer architecture description] Four-layer architecture and integration sections: benefits of the environment-assets / task-generation / trajectory-collection / benchmark-evaluation layers and the cited integrations (D-VLA, RL-VLA3, Sword, Pre-VLA) are asserted as design-enabled outcomes, yet no ablation studies, throughput figures, trajectory-variance statistics, or baseline comparisons are reported.

Authors: The four-layer architecture is presented as a proposed structure to achieve the goals of scalability and reproducibility. The integrations are examples of systems that can leverage this infrastructure. We will revise these sections to clarify that the benefits are hypothesized based on the architecture and that no empirical studies are included in this work, as the paper focuses on the infrastructure foundation rather than specific performance metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: systems architecture paper with no derivations or predictions

full rationale

The manuscript is a descriptive systems paper proposing a four-layer cloud-native simulation framework. It contains no equations, no fitted parameters, no predictions of quantities, and no derivation chains. Claims about scalability and reproducibility are presented as enabled by the adopted technologies (elastic scheduling, containerization) rather than derived from any inputs or self-citations. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The central assertions reduce to design choices, not to any tautological reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced because the paper is a high-level systems description without mathematical content or new postulated objects.

pith-pipeline@v0.9.1-grok · 5768 in / 1020 out tokens · 26288 ms · 2026-06-29T04:23:34.117132+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 14 linked inside Pith

[1]

RT-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, A vinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashniko...

2023
[2]

Ryoo, Grecia Salazar, Pannag R

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deepak Manjunath, Igor Mordatch...

2023
[3]

OpenVLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[4]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Teodor Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024
[5]

World models

David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

Pith/arXiv arXiv 2018
[6]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020

2020
[7]

RL-VLA3: A flexible and asynchronous reinforcement learning framework for vla training

Haoran Sun, Yongjian Guo, Zhong Guan, Shuai Di, Xiaodong Bai, Jing Long, Tianyun Zhao, Mingxi Luo, Hongke Zhao, Likang Wu, Xiaotie Deng, Xu Chu, Xi Xiao, Sheng Wen, Yicheng Gong, and Junwu Xiong. RL-VLA3: A flexible and asynchronous reinforcement learning framework for vla training. arXiv preprint arXiv:2602.05765, 2026

Pith/arXiv arXiv 2026
[8]

D-VLA: A high-concurrency distributed asynchronous reinforcement learning framework for vision-language-action models

Yucheng Guo, Yongjian Guo, Zhong Guan, Wen Huang, Haoran Sun, Haodong Yue, Xiaolong Xiang, Shuai Di, Zhen Sun, Luqiao Wang, Junwu Xiong, and Yicheng Gong. D-VLA: A high-concurrency distributed asynchronous reinforcement learning framework for vision-language-action models. arXiv preprint arXiv:2605.13276, 2026

Pith/arXiv arXiv 2026
[9]

Robert E. Shannon. Introduction to the art and science of simulation. In Proceedings of the 30th Conference on Winter Simulation, pages 7–14, 1998

1998
[10]

Domain random- ization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain random- ization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30, 2017

2017
[11]

CAD2RL: Real single-image flight without a single real image

Fereshteh Sadeghi and Sergey Levine. CAD2RL: Real single-image flight without a single real image. In Robotics: Science and Systems, 2017

2017
[12]

Sim-to-real transfer of robotic control with dynamics randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In IEEE International Conference on Robotics and Automation, pages 3803–3810, 2018

2018
[13]

Isaac Gym: High performance GPU-based physics simulation for robot learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac Gym: High performance GPU-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021. 27

Pith/arXiv arXiv 2021
[14]

MuJoCo: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012

2012
[15]

Chang, Leonidas J

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part- based interactive environment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020

2020
[16]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark and learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020
[17]

ManiSkill2: A unified benchmark for generalizable manipulation skills

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. ManiSkill2: A unified benchmark for generalizable manipulation skills. In International Conference on Learning Representations, 2023

2023
[18]

LIBERO: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. In Advances in Neural Information Processing Systems, 2024

2024
[19]

RoboCasa: Large-scale simulation of everyday tasks for generalist robots

Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems, 2024

2024
[20]

AI2-THOR: An interactive 3d environment for visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An interactive 3d environment for visual AI. arXiv preprint arXiv:1712.05474, 2017

Pith/arXiv arXiv 2017
[21]

Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel X. Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Trainin...

2021
[22]

BEHA VIOR: Benchmark for everyday household activities in virtual, interactive, and ecological environments

Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. BEHA VIOR: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, 2022

2022
[23]

CAL VIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CAL VIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. In IEEE Robotics and Automation Letters, 2022

2022
[24]

Open X-Embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. In IEEE International Conference on Robotics and Automation, 2024

2024
[25]

DROID: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter Luo, Fan Qian, Ethan Zhu, Dibya Gandhi, Bradly Stadie, Austin Stone, Michael Chiang, Fei Xia, Chelsea Finn, and Sergey Levine. DROID: A large-scale in-the-wild robot man...

2024
[26]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. Conference on Robot Learning Workshop, 2023

2023
[27]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565, 2019

2019
[28]

Sword: Style-robust world models as simulators via dynamic latent bootstrapping for vla policy post-training

Jiaxuan Gao, Yongjian Guo, Zhong Guan, Wen Huang, Wanlun Ma, Xi Xiao, Junwu Xiong, and Sheng Wen. Sword: Style-robust world models as simulators via dynamic latent bootstrapping for vla policy post-training. arXiv preprint arXiv:2605.07288, 2026

Pith/arXiv arXiv 2026
[29]

Pre-vla: Preemptive runtime verification for reliable vision-language-action and world-model rollouts

Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, and Zhijun Meng. Pre-vla: Preemptive runtime verification for reliable vision-language-action and world-model rollouts. arXiv preprint arXiv:2605.22446, 2026

Pith/arXiv arXiv 2026
[30]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Robotics: Science and Systems, 2023. 28

2023
[31]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Robotics: Science and Systems, 2023

2023
[32]

Design and use paradigms for Gazebo, an open-source multi-robot simulator

Nathan Koenig and Andrew Howard. Design and use paradigms for Gazebo, an open-source multi-robot simulator. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2149–2154, 2004

2004
[33]

Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y. Ng. ROS: An open-source robot operating system. In ICRA Workshop on Open Source Software, 2009

2009
[34]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016

Pith/arXiv arXiv 2016
[35]

Deepmind control suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

Pith/arXiv arXiv 2018
[36]

robosuite: A modular simulation framework and benchmark for robot learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martin-Martin, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020

Pith/arXiv arXiv 2009
[37]

PyBullet, a Python module for physics simulation for games, robotics and machine learning

Erwin Coumans and Yunfei Bai. PyBullet, a Python module for physics simulation for games, robotics and machine learning. GitHub repository, 2016

2016
[38]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100, 2020

2020
[39]

Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: Real- world perception for embodied agents. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9068–9079, 2018

2018
[40]

iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes

Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Shyamal Buch, Claudia D’Arpino, Sanjana Srivastava, Lyne Tchapmi, Kent Vainio, James Wong, Li Fei-Fei, and Silvio Savarese. iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In IEEE/RSJ International Conference on Intelligent Robots and S...

2021
[41]

Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang

Angel X. Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In International Conference on 3D Vision, pages 667–676, 2017

2017
[42]

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Ming Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M. Stra...

Pith/arXiv arXiv 1906
[43]

ProcTHOR: Large-scale embodied AI using procedural generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-scale embodied AI using procedural generation. In Advances in Neural Information Processing Systems, 2022

2022
[44]

Habitat: A platform for embodied AI research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied AI research. In IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

2019
[45]

Vision-and-language navigation: Interpreting visually-grounded navigation instruc- tions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instruc- tions in real environments. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018

2018
[46]

ALFRED: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettle- moyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020. 29

2020
[47]

TEACh: Task-driven embodied agents that chat

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. TEACh: Task-driven embodied agents that chat. In AAAI Conference on Artificial Intelligence, pages 2017–2025, 2022

2017
[48]

Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics

Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In Robotics: Science and Systems, 2017

2017
[49]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. In Journal of Machine Learning Research, volume 17, pages 1–40, 2016

2016
[50]

QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673, 2018

2018
[51]

RoboNet: Large-scale multi-robot learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-scale multi-robot learning. In Conference on Robot Learning, pages 885–897, 2019

2019
[52]

What matters in learning from offline human demonstrations for robot manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning, 2021

2021
[53]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

Pith/arXiv arXiv 2023
[54]

Mas- tering atari, go, chess and shogi by planning with a learned model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mas- tering atari, go, chess and shogi by planning with a learned model. Nature, 588:604–609, 2020

2020
[55]

A generalist agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth- Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. Transactions on...

2022
[56]

Do as i can, not as i say: Grounding language in robotic affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang...

2022
[57]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodie...

2023
[58]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tianhe Yu Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, 2022

2022
[59]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation, pages 9493–9500, 2023. 30

2023

[1] [1]

RT-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, A vinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashniko...

2023

[2] [2]

Ryoo, Grecia Salazar, Pannag R

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deepak Manjunath, Igor Mordatch...

2023

[3] [3]

OpenVLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[4] [4]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Teodor Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024

[5] [5]

World models

David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

Pith/arXiv arXiv 2018

[6] [6]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020

2020

[7] [7]

RL-VLA3: A flexible and asynchronous reinforcement learning framework for vla training

Haoran Sun, Yongjian Guo, Zhong Guan, Shuai Di, Xiaodong Bai, Jing Long, Tianyun Zhao, Mingxi Luo, Hongke Zhao, Likang Wu, Xiaotie Deng, Xu Chu, Xi Xiao, Sheng Wen, Yicheng Gong, and Junwu Xiong. RL-VLA3: A flexible and asynchronous reinforcement learning framework for vla training. arXiv preprint arXiv:2602.05765, 2026

Pith/arXiv arXiv 2026

[8] [8]

D-VLA: A high-concurrency distributed asynchronous reinforcement learning framework for vision-language-action models

Yucheng Guo, Yongjian Guo, Zhong Guan, Wen Huang, Haoran Sun, Haodong Yue, Xiaolong Xiang, Shuai Di, Zhen Sun, Luqiao Wang, Junwu Xiong, and Yicheng Gong. D-VLA: A high-concurrency distributed asynchronous reinforcement learning framework for vision-language-action models. arXiv preprint arXiv:2605.13276, 2026

Pith/arXiv arXiv 2026

[9] [9]

Robert E. Shannon. Introduction to the art and science of simulation. In Proceedings of the 30th Conference on Winter Simulation, pages 7–14, 1998

1998

[10] [10]

Domain random- ization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain random- ization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30, 2017

2017

[11] [11]

CAD2RL: Real single-image flight without a single real image

Fereshteh Sadeghi and Sergey Levine. CAD2RL: Real single-image flight without a single real image. In Robotics: Science and Systems, 2017

2017

[12] [12]

Sim-to-real transfer of robotic control with dynamics randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In IEEE International Conference on Robotics and Automation, pages 3803–3810, 2018

2018

[13] [13]

Isaac Gym: High performance GPU-based physics simulation for robot learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac Gym: High performance GPU-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021. 27

Pith/arXiv arXiv 2021

[14] [14]

MuJoCo: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012

2012

[15] [15]

Chang, Leonidas J

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part- based interactive environment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020

2020

[16] [16]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark and learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020

[17] [17]

ManiSkill2: A unified benchmark for generalizable manipulation skills

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. ManiSkill2: A unified benchmark for generalizable manipulation skills. In International Conference on Learning Representations, 2023

2023

[18] [18]

LIBERO: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. In Advances in Neural Information Processing Systems, 2024

2024

[19] [19]

RoboCasa: Large-scale simulation of everyday tasks for generalist robots

Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems, 2024

2024

[20] [20]

AI2-THOR: An interactive 3d environment for visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An interactive 3d environment for visual AI. arXiv preprint arXiv:1712.05474, 2017

Pith/arXiv arXiv 2017

[21] [21]

Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel X. Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Trainin...

2021

[22] [22]

BEHA VIOR: Benchmark for everyday household activities in virtual, interactive, and ecological environments

Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. BEHA VIOR: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, 2022

2022

[23] [23]

CAL VIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CAL VIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. In IEEE Robotics and Automation Letters, 2022

2022

[24] [24]

Open X-Embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. In IEEE International Conference on Robotics and Automation, 2024

2024

[25] [25]

DROID: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter Luo, Fan Qian, Ethan Zhu, Dibya Gandhi, Bradly Stadie, Austin Stone, Michael Chiang, Fei Xia, Chelsea Finn, and Sergey Levine. DROID: A large-scale in-the-wild robot man...

2024

[26] [26]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. Conference on Robot Learning Workshop, 2023

2023

[27] [27]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565, 2019

2019

[28] [28]

Sword: Style-robust world models as simulators via dynamic latent bootstrapping for vla policy post-training

Jiaxuan Gao, Yongjian Guo, Zhong Guan, Wen Huang, Wanlun Ma, Xi Xiao, Junwu Xiong, and Sheng Wen. Sword: Style-robust world models as simulators via dynamic latent bootstrapping for vla policy post-training. arXiv preprint arXiv:2605.07288, 2026

Pith/arXiv arXiv 2026

[29] [29]

Pre-vla: Preemptive runtime verification for reliable vision-language-action and world-model rollouts

Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, and Zhijun Meng. Pre-vla: Preemptive runtime verification for reliable vision-language-action and world-model rollouts. arXiv preprint arXiv:2605.22446, 2026

Pith/arXiv arXiv 2026

[30] [30]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Robotics: Science and Systems, 2023. 28

2023

[31] [31]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Robotics: Science and Systems, 2023

2023

[32] [32]

Design and use paradigms for Gazebo, an open-source multi-robot simulator

Nathan Koenig and Andrew Howard. Design and use paradigms for Gazebo, an open-source multi-robot simulator. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2149–2154, 2004

2004

[33] [33]

Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y. Ng. ROS: An open-source robot operating system. In ICRA Workshop on Open Source Software, 2009

2009

[34] [34]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016

Pith/arXiv arXiv 2016

[35] [35]

Deepmind control suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

Pith/arXiv arXiv 2018

[36] [36]

robosuite: A modular simulation framework and benchmark for robot learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martin-Martin, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020

Pith/arXiv arXiv 2009

[37] [37]

PyBullet, a Python module for physics simulation for games, robotics and machine learning

Erwin Coumans and Yunfei Bai. PyBullet, a Python module for physics simulation for games, robotics and machine learning. GitHub repository, 2016

2016

[38] [38]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100, 2020

2020

[39] [39]

Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: Real- world perception for embodied agents. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9068–9079, 2018

2018

[40] [40]

iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes

Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Shyamal Buch, Claudia D’Arpino, Sanjana Srivastava, Lyne Tchapmi, Kent Vainio, James Wong, Li Fei-Fei, and Silvio Savarese. iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In IEEE/RSJ International Conference on Intelligent Robots and S...

2021

[41] [41]

Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang

Angel X. Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In International Conference on 3D Vision, pages 667–676, 2017

2017

[42] [42]

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Ming Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M. Stra...

Pith/arXiv arXiv 1906

[43] [43]

ProcTHOR: Large-scale embodied AI using procedural generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-scale embodied AI using procedural generation. In Advances in Neural Information Processing Systems, 2022

2022

[44] [44]

Habitat: A platform for embodied AI research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied AI research. In IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

2019

[45] [45]

Vision-and-language navigation: Interpreting visually-grounded navigation instruc- tions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instruc- tions in real environments. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018

2018

[46] [46]

ALFRED: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettle- moyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020. 29

2020

[47] [47]

TEACh: Task-driven embodied agents that chat

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. TEACh: Task-driven embodied agents that chat. In AAAI Conference on Artificial Intelligence, pages 2017–2025, 2022

2017

[48] [48]

Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics

Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In Robotics: Science and Systems, 2017

2017

[49] [49]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. In Journal of Machine Learning Research, volume 17, pages 1–40, 2016

2016

[50] [50]

QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673, 2018

2018

[51] [51]

RoboNet: Large-scale multi-robot learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-scale multi-robot learning. In Conference on Robot Learning, pages 885–897, 2019

2019

[52] [52]

What matters in learning from offline human demonstrations for robot manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning, 2021

2021

[53] [53]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

Pith/arXiv arXiv 2023

[54] [54]

Mas- tering atari, go, chess and shogi by planning with a learned model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mas- tering atari, go, chess and shogi by planning with a learned model. Nature, 588:604–609, 2020

2020

[55] [55]

A generalist agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth- Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. Transactions on...

2022

[56] [56]

Do as i can, not as i say: Grounding language in robotic affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang...

2022

[57] [57]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodie...

2023

[58] [58]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tianhe Yu Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, 2022

2022

[59] [59]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation, pages 9493–9500, 2023. 30

2023