arxiv: 2403.09227 · v1 · submitted 2024-03-14 · 💻 cs.RO · cs.AI

Recognition: 1 theorem link

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Chengshu Li , Ruohan Zhang , Josiah Wong , Cem Gokmen , Sanjana Srivastava , Roberto Mart\'in-Mart\'in , Chen Wang , Gabrael Levine

show 27 more authors

Wensi Ai Benjamin Martinez Hang Yin Michael Lingelbach Minjune Hwang Ayano Hiranaka Sujay Garlanka Arman Aydin Sharon Lee Jiankai Sun Mona Anvari Manasi Sharma Dhruva Bansal Samuel Hunter Kyu-Young Kim Alan Lou Caleb R Matthews Ivan Villa-Renteria Jerry Huayang Tang Claire Tang Fei Xia Yunzhu Li Silvio Savarese Hyowon Gweon C. Karen Liu Jiajun Wu Li Fei-Fei

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:33 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords embodied AIrobot learningsimulation benchmarkhuman-centered roboticseveryday activitiesmanipulation skillslong-horizon tasks

0 comments

The pith

BEHAVIOR-1K benchmark defines 1,000 human survey-based everyday activities in realistic physics simulation to test embodied AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEHAVIOR-1K as a benchmark built from an extensive survey asking people what tasks they want robots to perform for them. It specifies 1,000 activities across 50 scenes containing more than 9,000 objects, all supported by the OMNIGIBSON simulator that models rigid bodies, deformable materials, and liquids with realistic physics and rendering. Experiments demonstrate that these tasks require long sequences of actions and sophisticated manipulation skills that exceed the capabilities of current state-of-the-art robot learning methods. The work also reports an early calibration study transferring a mobile manipulator policy from the simulated apartment to its physical counterpart.

Core claim

The central claim is that BEHAVIOR-1K supplies a human-grounded collection of 1,000 everyday activities together with the OMNIGIBSON environment that renders the necessary physical interactions, and that these activities expose the inability of existing robot learning approaches to handle extended horizons and complex manipulations.

What carries the argument

The BEHAVIOR-1K benchmark pairs survey-derived activity specifications with the OMNIGIBSON simulator that enables realistic rigid-body, deformable-body, and liquid interactions across diverse indoor and outdoor scenes.

If this is right

State-of-the-art robot learning methods cannot yet complete the long-horizon activities that require complex manipulation.
An initial study transfers a mobile-manipulator policy from simulation to a real apartment and measures the resulting performance gap.
The benchmark supplies a diverse, human-centered testbed intended to drive progress in embodied AI and robot learning research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithms that incorporate explicit long-horizon planning or hierarchical decomposition may be needed to make progress on these tasks.
Prioritizing development around the most frequently requested activity categories from the survey could focus research effort on high-impact capabilities.
Reducing the documented sim-to-real gap through improved physics modeling would allow more reliable deployment of policies trained in the benchmark.
Adding controlled variations in object properties or scene layouts would provide a direct test of generalization beyond the fixed 50 scenes.

Load-bearing premise

The 1,000 activities drawn from the human survey accurately reflect tasks people want robots to perform and the OMNIGIBSON simulation is realistic enough to support meaningful transfer to physical robots.

What would settle it

A new survey of comparable size showing substantially different priority activities, or a learned policy achieving high success rates in simulation but failing to transfer to the matching real apartment, would undermine the benchmark's premises.

read the original abstract

We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with rich physical and semantic properties. The second is OMNIGIBSON, a novel simulation environment that supports these activities via realistic physics simulation and rendering of rigid bodies, deformable bodies, and liquids. Our experiments indicate that the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions. To calibrate the simulation-to-reality gap of BEHAVIOR-1K, we provide an initial study on transferring solutions learned with a mobile manipulator in a simulated apartment to its real-world counterpart. We hope that BEHAVIOR-1K's human-grounded nature, diversity, and realism make it valuable for embodied AI and robot learning research. Project website: https://behavior.stanford.edu.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEHAVIOR-1K adds scale and better physics to embodied benchmarks but needs stronger baseline details to back its challenge claims.

read the letter

This paper's key offering is BEHAVIOR-1K: 1,000 activities from a human survey, placed in 50 scenes with detailed object properties, plus the OMNIGIBSON simulator that adds realistic handling of deformables and liquids. It improves on earlier simulation environments by increasing scale and physics fidelity. The survey motivation makes the tasks more relevant to actual robot use cases than purely synthetic ones. The reported experiments show these tasks challenge state-of-the-art robot learning, especially in long-horizon planning and manipulation. An initial sim-to-real study adds some credibility to the simulation. One area to watch is the baseline selection. The claim that current solutions struggle holds if the tested methods represent the frontier. If they skip recent LLM-based or hierarchical approaches, the results might reflect gaps in the chosen baselines rather than the benchmark's full difficulty. The abstract lacks quantitative details, so the paper should include those clearly. No major issues with the setup or citations. It builds straightforwardly on prior work without fitting parameters to the eval data. This work is for embodied AI researchers who need a broad, human-centered set of tasks for evaluation and development. It would be useful for anyone working on household robotics or sim-to-real transfer. I would send it to peer review. The benchmark itself is substantial enough to warrant referee attention.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BEHAVIOR-1K, a benchmark for embodied AI and robotics consisting of 1,000 everyday activities derived from a human survey, grounded in 50 scenes with over 9,000 objects having rich physical and semantic annotations. These activities are implemented in the OMNIGIBSON simulator, which supports realistic physics for rigid bodies, deformable bodies, and liquids. Experiments are claimed to demonstrate that the activities are long-horizon and require complex manipulation skills that challenge state-of-the-art robot learning solutions, with an initial sim-to-real transfer study on a mobile manipulator from simulated apartment to real world.

Significance. If the quantitative validation holds and the survey accurately captures desired tasks while the simulator supports meaningful transfer, BEHAVIOR-1K could serve as a valuable standardized testbed for long-horizon planning and manipulation research in embodied AI. The human-grounded derivation and broad scene/object diversity are clear strengths that address gaps in existing benchmarks.

major comments (2)

[Abstract] Abstract: The central claim that 'the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions' is not supported by any visible quantitative results, success rates, error bars, or named baseline methods. This is load-bearing for the paper's contribution and requires explicit tables or figures in the experiments section showing performance of current methods (including recent LLM-based or hierarchical planners) versus the claimed difficulty.
[Sim-to-real study] Sim-to-real study: The initial calibration of the simulation-to-reality gap is described only at a high level without specific tasks tested, quantitative transfer metrics, adaptation protocols, or success rates. This detail is necessary to substantiate the realism claim for OMNIGIBSON and the benchmark's utility for robot learning.

minor comments (2)

[§3] Add an explicit example of an activity definition (including scene, objects, and success criteria) early in the manuscript to clarify the annotation scheme.
[Experiments] Ensure the experiments section includes a comparison table with recent long-horizon techniques to address potential gaps in baseline coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of BEHAVIOR-1K's contributions and for the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly to strengthen the quantitative support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions' is not supported by any visible quantitative results, success rates, error bars, or named baseline methods. This is load-bearing for the paper's contribution and requires explicit tables or figures in the experiments section showing performance of current methods (including recent LLM-based or hierarchical planners) versus the claimed difficulty.

Authors: We agree that making the difficulty of BEHAVIOR-1K activities more explicit through quantitative baselines would strengthen the paper. The experiments section already reports task completion statistics and failure analyses across multiple activities that illustrate their long-horizon and manipulation complexity, but we acknowledge these are not presented as direct comparisons against named SOTA methods. In the revision we will add a new table (and accompanying figure) in the experiments section that reports success rates, with error bars, for representative activities using current baselines including recent LLM-based planners and hierarchical methods. This will directly support the abstract claim. revision: yes
Referee: [Sim-to-real study] Sim-to-real study: The initial calibration of the simulation-to-reality gap is described only at a high level without specific tasks tested, quantitative transfer metrics, adaptation protocols, or success rates. This detail is necessary to substantiate the realism claim for OMNIGIBSON and the benchmark's utility for robot learning.

Authors: We agree that additional detail on the sim-to-real study is warranted. The current description is intentionally high-level as an initial calibration, but we will expand this section in the revision to specify the exact tasks evaluated (e.g., object pick-and-place sequences in the apartment scene), the quantitative transfer metrics (sim vs. real success rates), the adaptation protocols employed (including domain randomization parameters), and the achieved success rates. These additions will provide a clearer picture of the simulation-to-reality gap. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark defined from external survey and standard physics engine

full rationale

The paper constructs BEHAVIOR-1K by selecting 1,000 activities from an independent human survey and implementing them in the OMNIGIBSON simulator using established rigid-body, deformable, and liquid physics. No equations, fitted parameters, or self-citations are used to derive the claimed long-horizon difficulty or sim-to-real properties; these are presented as direct empirical observations from the defined benchmark rather than reductions to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the survey-derived activity set and the fidelity of the physics simulation to real-world conditions; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption The survey results on desired robot tasks accurately capture human preferences for everyday assistance.
The 1,000 activities are defined and motivated by the results of this survey.

pith-pipeline@v0.9.0 · 5673 in / 1310 out tokens · 46531 ms · 2026-05-16T23:33:45.785896+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
cs.RO 2026-04 unverdicted novelty 8.0

RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
cs.RO 2026-04 unverdicted novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics
cs.RO 2026-04 unverdicted novelty 7.0

asRoBallet achieves the first hardware deployment of an end-to-end RL policy for a humanoid ballbot by training in a high-fidelity simulation that models discrete roller mechanics and multi-channel friction for zero-s...
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror
cs.AI 2026-04 unverdicted novelty 7.0

MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
cs.RO 2026-05 unverdicted novelty 6.0

HeteroGenManip decouples grasp localization from interaction planning using task-conditioned foundation models and multi-model diffusion policies, delivering 31% average gains in broad simulation tasks and 36.7% in fo...
HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
cs.RO 2026-05 unverdicted novelty 6.0

A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
cs.RO 2026-05 unverdicted novelty 6.0

VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics
cs.RO 2026-04 unverdicted novelty 6.0

The paper presents the first successful zero-shot Sim2Real transfer of a friction-aware RL policy for a humanoid ballbot on physical hardware.
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
cs.RO 2026-04 unverdicted novelty 6.0

EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
cs.RO 2026-04 unverdicted novelty 6.0

RoboLab is a photorealistic simulation benchmark with 120 tasks and perturbation analysis to evaluate true generalization and robustness of robotic foundation models.
FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
cs.CV 2026-04 unverdicted novelty 6.0

FunRec reconstructs interactable 3D scenes with articulated parts from in-the-wild egocentric interaction videos, automatically discovering parts, estimating kinematics, and producing simulation-compatible meshes with...
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains
cs.RO 2026-04 unverdicted novelty 6.0

RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.
Physically Accurate Rigid-Body Dynamics in Particle-Based Simulation
cs.RO 2026-03 unverdicted novelty 6.0

PBD-R adds a momentum-conservation constraint to position-based dynamics to deliver physically accurate rigid-body dynamics while remaining computationally lighter than MuJoCo.
RISE: Self-Improving Robot Policy with Compositional World Model
cs.RO 2026-02 unverdicted novelty 6.0

RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot
cs.RO 2026-01 unverdicted novelty 6.0

Genie Sim 3.0 introduces an LLM-powered scene generator, the first LLM-based automated evaluation benchmark, and a large open synthetic dataset that demonstrates zero-shot sim-to-real transfer for robotic manipulation...
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 17 Pith papers · 9 internal anchors

[1]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition , pages 248–255, 2009. 9

work page 2009
[2]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014

work page 2014
[3]

Everingham, L

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010

work page 2010
[4]

Krishna, Y

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017

work page 2017
[5]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition , pages 3354–3361. IEEE, 2012

work page 2012
[6]

something something

R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017

work page 2017
[7]

G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari. Actor and observer: Joint modeling of first and third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7396–7404, 2018

work page 2018
[8]

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Y . Xiang, T. Schmidt, V . Narayanan, and D. Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Martín-Martín, M

R. Martín-Martín, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese. Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

work page 2021
[10]

Caba Heilbron, V

F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015

work page 2015
[11]

Gurari, Q

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018

work page 2018
[12]

M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Using Large Corpora, page 273, 1994

work page 1994
[13]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task bench- mark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Socher, A

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y . Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, 2013

work page 2013
[16]

Antol, A

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015

work page 2015
[17]

Batra, A

D. Batra, A. X. Chang, S. Chernova, A. J. Davison, J. Deng, V . Koltun, S. Levine, J. Malik, I. Mordatch, R. Mottaghi, M. Savva, and H. Su. Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:2011.01975, 2020. 10

work page arXiv 2011
[18]

Weihs, M

L. Weihs, M. Deitke, A. Kembhavi, and R. Mottaghi. Visual room rearrangement. arXiv preprint arXiv:2103.16544, 2021

work page arXiv 2021
[19]

C. Gan, S. Zhou, J. Schwartz, S. Alter, A. Bhandwaldar, D. Gutfreund, D. L. Yamins, J. J. DiCarlo, J. McDermott, A. Torralba, et al. The threedworld transport challenge: A visually guided task-and-motion planning benchmark for physically realistic embodied ai. arXiv preprint arXiv:2103.14025, 2021

work page arXiv 2021
[20]

Puig et al

X. Puig et al. Virtualhome: Simulating household activities via programs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

work page 2018
[21]

Shridhar, J

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020

work page 2020
[22]

F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 5(2):713–720, 2020

work page 2020
[23]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100. PMLR, 2020

work page 2020
[24]

James, Z

S. James, Z. Ma, D. Rovick Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020

work page 2020
[25]

Savva, A

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

work page 2019
[26]

A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. S. Chaplot, O. Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems, volume 34, 2021

work page 2021
[27]

Srivastava, C

S. Srivastava, C. Li, M. Lingelbach, R. Martín-Martín, F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch, K. Liu, et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pages 477–490. PMLR, 2022

work page 2022
[28]

Y . Zhu, J. Wong, A. Mandlekar, and R. Martín-Martín. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[29]

DeepMind Control Suite

Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[31]

M. L. Littman, I. Ajunwa, G. Berger, C. Boutilier, M. Currie, F. Doshi-Velez, G. Hadfield, M. C. Horowitz, C. Isbell, H. Kitano, et al. Gathering strength, gathering storms: The one hundred year study on artificial intelligence (AI100) 2021 study panel report. Technical report, Stanford University, 2021

work page 2021
[32]

M. O. Riedl. Human-centered artificial intelligence and machine learning. Human Behavior and Emerging Technologies, 1(1):33–36, 2019

work page 2019
[33]

W. Xu. Toward human-centered ai: a perspective from human-computer interaction.Interactions, 26(4):42–46, 2019. 11

work page 2019
[34]

Shneiderman

B. Shneiderman. Bridging the gap between ethics and practice: guidelines for reliable, safe, and trustworthy human-centered ai systems. ACM Transactions on Interactive Intelligent Systems (TiiS), 10(4):1–31, 2020

work page 2020
[35]

Kitano, M

H. Kitano, M. Asada, Y . Kuniyoshi, I. Noda, E. Osawa, and H. Matsubara. Robocup: A challenge problem for ai. AI magazine, 18(1):73–73, 1997

work page 1997
[36]

Wisspeintner, T

T. Wisspeintner, T. Van Der Zant, L. Iocchi, and S. Schiffer. Robocup@home: Scientific competition and benchmarking for domestic service robots. Interaction Studies, 10(3):392–426, 2009

work page 2009
[37]

Iocchi, D

L. Iocchi, D. Holz, J. Ruiz-del Solar, K. Sugiura, and T. Van Der Zant. Robocup@ home: Anal- ysis and results of evolving competitions for domestic and service robots. Artificial Intelligence, 229:258–281, 2015

work page 2015
[38]

Buehler, K

M. Buehler, K. Iagnemma, and S. Singh. The DARPA Urban Challenge: Autonomous Vehicles in City Traffic, volume 56. springer, 2009

work page 2009
[39]

Krotkov, D

E. Krotkov, D. Hackett, L. Jackel, M. Perschbacher, J. Pippine, J. Strauss, G. Pratt, and C. Orlowski. The darpa robotics challenge finals: Results and perspectives. Journal of Field Robotics, 34(2):229–240, 2017

work page 2017
[40]

Correll, K

N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman. Analysis and observations from the first amazon picking challenge. IEEE Transactions on Automation Science and Engineering, 15(1):172–188, 2016

work page 2016
[41]

Eppner, S

C. Eppner, S. Höfer, R. Jonschkowski, R. Martín-Martín, A. Sieverling, V . Wall, and O. Brock. Lessons from the amazon picking challenge: four aspects of building robotic systems. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4831– 4835, 2017

work page 2017
[42]

M. A. Roa, M. Dogar, C. Vivas, A. Morales, N. Correll, M. Gorner, J. Rosell, S. Foix, R. Memmesheimer, F. Ferro, et al. Mobile manipulation hackathon: Moving into real world applications. IEEE Robotics & Automation Magazine, pages 2–14, 2021

work page 2021
[43]

Heiden, M

E. Heiden, M. Macklin, Y . Narang, D. Fox, A. Garg, and F. Ramos. Disect: A differentiable simulation engine for autonomous robotic cutting. arXiv preprint arXiv:2105.12244, 2021

work page arXiv 2021
[44]

Urakami, A

Y . Urakami, A. Hodgkinson, C. Carlin, R. Leu, L. Rigazio, and P. Abbeel. Doorgym: A scalable door opening environment and baseline agent. arXiv preprint arXiv:1908.01887, 2019

work page arXiv 1908
[45]

X. Lin, Y . Wang, J. Olkin, and D. Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. In Conference on Robot Learning, 2020

work page 2020
[46]

Nvidia, Corp. Physx. https://developer.nvidia.com/physx-sdk, 2022. Accessed: 2022- 06-10

work page 2022
[47]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018

work page 2018
[49]

Jordan and A

M. Jordan and A. Perez. Optimal bidirectional rapidly-exploring random trees. Technical Report MIT-CSAIL-TR-2013-021, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 2013

work page 2013
[50]

Bureau of Labor Statistics

U.S. Bureau of Labor Statistics. American Time Use Survey. https://www.bls.gov/tus/, 2019. 12

work page 2019
[51]

Harmonised european time use surveys

European Commission. Harmonised european time use surveys. https://ec.europa.eu/ eurostat/web/time-use-surveys, 2010

work page 2010
[52]

Gershuny, M

J. Gershuny, M. Vega-Rapun, and J. Lamote. Multinational time use study. https://www. timeuse.org/mtus, 2020

work page 2020
[53]

wikiHow, Inc. wikihow. https://www.wikihow.com, 2021. Accessed: 2021-06-16

work page 2021
[54]

Xiang, Y

F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. SAPIEN: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020

work page 2020
[55]

T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483, 2021

work page arXiv 2021
[56]

H. Fu, W. Xu, H. Xue, H. Yang, R. Ye, Y . Huang, Z. Xue, Y . Wang, and C. Lu. Rfuniverse: A physics-based action-centric interactive environment for everyday household tasks. arXiv preprint arXiv:2202.00199, 2022

work page arXiv 2022
[57]

G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11): 39–41, 1995

work page 1995
[58]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

work page 1901
[59]

C. Li, F. Xia, R. Martín-Martín, M. Lingelbach, S. Srivastava, B. Shen, K. E. Vainio, C. Gokmen, G. Dharan, T. Jain, A. Kurenkov, K. Liu, H. Gweon, J. Wu, L. Fei-Fei, and S. Savarese. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In Annual Conference on Robot Learning, 2021

work page 2021
[60]

A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning.Discrete Event Dynamic Systems, 13(1):41–77, 2003

work page 2003
[61]

S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006

work page 2006
[62]

J. J. Kuffner and S. M. LaValle. Rrt-connect: An efficient approach to single-query path planning. In Proceedings IEEE International Conference on Robotics and Automation, volume 2, pages 995–1001. IEEE, 2000

work page 2000
[63]

Ehsani, W

K. Ehsani, W. Han, A. Herrasti, E. VanderBilt, L. Weihs, E. Kolve, A. Kembhavi, and R. Mot- taghi. Manipulathor: A framework for visual object manipulation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 4497–4506, 2021

work page 2021
[64]

C. Li, F. Xia, R. Martín-Martín, and S. Savarese. Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators. In Conference on Robot Learning, pages 603–616. PMLR, 2020

work page 2020
[65]

Alipov, R

V . Alipov, R. Simmons-Edler, N. Putintsev, P. Kalinin, and D. Vetrov. Towards practical credit assignment for deep reinforcement learning. arXiv preprint arXiv:2106.04499, 2021

work page arXiv 2021
[66]

T. Yang, H. Tang, C. Bai, J. Liu, J. Hao, Z. Meng, and P. Liu. Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668, 2021

work page arXiv 2021
[67]

Osband, B

I. Osband, B. V . Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019

work page 2019
[68]

Hochreiter

S. Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 6(2):107–116, 1998. 13

work page 1998
[69]

F. Xia, C. Li, R. Martín-Martín, O. Litany, A. Toshev, and S. Savarese. ReLMoGen: Leveraging motion generation in reinforcement learning for mobile manipulation. In IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020

work page 2020
[70]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[71]

Bjelonic

M. Bjelonic. YOLO ROS: Real-time object detection for ROS. https://github.com/ leggedrobotics/darknet_ros, 2016–2018

work page 2016
[72]

Thrun, W

S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, 2005

work page 2005
[73]

L. Fan, G. Wang, D.-A. Huang, Z. Yu, L. Fei-Fei, Y . Zhu, and A. Anandkumar. Secant: Self- expert cloning for zero-shot generalization of visual policies. arXiv preprint arXiv:2106.09678, 2021

work page arXiv 2021
[74]

WikiHow: A Large Scale Text Summarization Dataset

M. Koupaee and W. Y . Wang. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[75]

R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932

work page 1932
[76]

D. C. Montgomery. Design and analysis of experiments. John Wiley & Sons, Inc., Hoboken, NJ, eighth edition, 2013

work page 2013
[77]

Paolacci, J

G. Paolacci, J. Chandler, and P. G. Ipeirotis. Running experiments on amazon mechanical turk. Judgment and Decision making, 5(5):411–419, 2010

work page 2010
[78]

P. R. Center. Research in the crowdsourcing age, a case study. Technical report, Wash- ington, D.C., July 2016. URL https://www.pewresearch.org/internet/2016/07/11/ research-in-the-crowdsourcing-age-a-case-study/

work page 2016
[79]

Akbik, D

A. Akbik, D. Blythe, and R. V ollgraf. Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649, 2018

work page 2018
[80]

Fox and D

M. Fox and D. Long. PDDL2.1: An extension to PDDL for expressing temporal planning domains. Journal of Artificial Intelligence Research, 20:61–124, dec 2003. doi:10.1613/jair.1129. URLhttps://doi.org/10.1613%2Fjair.1129

work page doi:10.1613/jair.1129 2003

Showing first 80 references.