pith. machine review for the scientific record. sign in

arxiv: 2403.09227 · v1 · submitted 2024-03-14 · 💻 cs.RO · cs.AI

Recognition: 1 theorem link

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:33 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords embodied AIrobot learningsimulation benchmarkhuman-centered roboticseveryday activitiesmanipulation skillslong-horizon tasks
0
0 comments X

The pith

BEHAVIOR-1K benchmark defines 1,000 human survey-based everyday activities in realistic physics simulation to test embodied AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEHAVIOR-1K as a benchmark built from an extensive survey asking people what tasks they want robots to perform for them. It specifies 1,000 activities across 50 scenes containing more than 9,000 objects, all supported by the OMNIGIBSON simulator that models rigid bodies, deformable materials, and liquids with realistic physics and rendering. Experiments demonstrate that these tasks require long sequences of actions and sophisticated manipulation skills that exceed the capabilities of current state-of-the-art robot learning methods. The work also reports an early calibration study transferring a mobile manipulator policy from the simulated apartment to its physical counterpart.

Core claim

The central claim is that BEHAVIOR-1K supplies a human-grounded collection of 1,000 everyday activities together with the OMNIGIBSON environment that renders the necessary physical interactions, and that these activities expose the inability of existing robot learning approaches to handle extended horizons and complex manipulations.

What carries the argument

The BEHAVIOR-1K benchmark pairs survey-derived activity specifications with the OMNIGIBSON simulator that enables realistic rigid-body, deformable-body, and liquid interactions across diverse indoor and outdoor scenes.

If this is right

  • State-of-the-art robot learning methods cannot yet complete the long-horizon activities that require complex manipulation.
  • An initial study transfers a mobile-manipulator policy from simulation to a real apartment and measures the resulting performance gap.
  • The benchmark supplies a diverse, human-centered testbed intended to drive progress in embodied AI and robot learning research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Algorithms that incorporate explicit long-horizon planning or hierarchical decomposition may be needed to make progress on these tasks.
  • Prioritizing development around the most frequently requested activity categories from the survey could focus research effort on high-impact capabilities.
  • Reducing the documented sim-to-real gap through improved physics modeling would allow more reliable deployment of policies trained in the benchmark.
  • Adding controlled variations in object properties or scene layouts would provide a direct test of generalization beyond the fixed 50 scenes.

Load-bearing premise

The 1,000 activities drawn from the human survey accurately reflect tasks people want robots to perform and the OMNIGIBSON simulation is realistic enough to support meaningful transfer to physical robots.

What would settle it

A new survey of comparable size showing substantially different priority activities, or a learned policy achieving high success rates in simulation but failing to transfer to the matching real apartment, would undermine the benchmark's premises.

read the original abstract

We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with rich physical and semantic properties. The second is OMNIGIBSON, a novel simulation environment that supports these activities via realistic physics simulation and rendering of rigid bodies, deformable bodies, and liquids. Our experiments indicate that the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions. To calibrate the simulation-to-reality gap of BEHAVIOR-1K, we provide an initial study on transferring solutions learned with a mobile manipulator in a simulated apartment to its real-world counterpart. We hope that BEHAVIOR-1K's human-grounded nature, diversity, and realism make it valuable for embodied AI and robot learning research. Project website: https://behavior.stanford.edu.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BEHAVIOR-1K, a benchmark for embodied AI and robotics consisting of 1,000 everyday activities derived from a human survey, grounded in 50 scenes with over 9,000 objects having rich physical and semantic annotations. These activities are implemented in the OMNIGIBSON simulator, which supports realistic physics for rigid bodies, deformable bodies, and liquids. Experiments are claimed to demonstrate that the activities are long-horizon and require complex manipulation skills that challenge state-of-the-art robot learning solutions, with an initial sim-to-real transfer study on a mobile manipulator from simulated apartment to real world.

Significance. If the quantitative validation holds and the survey accurately captures desired tasks while the simulator supports meaningful transfer, BEHAVIOR-1K could serve as a valuable standardized testbed for long-horizon planning and manipulation research in embodied AI. The human-grounded derivation and broad scene/object diversity are clear strengths that address gaps in existing benchmarks.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions' is not supported by any visible quantitative results, success rates, error bars, or named baseline methods. This is load-bearing for the paper's contribution and requires explicit tables or figures in the experiments section showing performance of current methods (including recent LLM-based or hierarchical planners) versus the claimed difficulty.
  2. [Sim-to-real study] Sim-to-real study: The initial calibration of the simulation-to-reality gap is described only at a high level without specific tasks tested, quantitative transfer metrics, adaptation protocols, or success rates. This detail is necessary to substantiate the realism claim for OMNIGIBSON and the benchmark's utility for robot learning.
minor comments (2)
  1. [§3] Add an explicit example of an activity definition (including scene, objects, and success criteria) early in the manuscript to clarify the annotation scheme.
  2. [Experiments] Ensure the experiments section includes a comparison table with recent long-horizon techniques to address potential gaps in baseline coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of BEHAVIOR-1K's contributions and for the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly to strengthen the quantitative support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions' is not supported by any visible quantitative results, success rates, error bars, or named baseline methods. This is load-bearing for the paper's contribution and requires explicit tables or figures in the experiments section showing performance of current methods (including recent LLM-based or hierarchical planners) versus the claimed difficulty.

    Authors: We agree that making the difficulty of BEHAVIOR-1K activities more explicit through quantitative baselines would strengthen the paper. The experiments section already reports task completion statistics and failure analyses across multiple activities that illustrate their long-horizon and manipulation complexity, but we acknowledge these are not presented as direct comparisons against named SOTA methods. In the revision we will add a new table (and accompanying figure) in the experiments section that reports success rates, with error bars, for representative activities using current baselines including recent LLM-based planners and hierarchical methods. This will directly support the abstract claim. revision: yes

  2. Referee: [Sim-to-real study] Sim-to-real study: The initial calibration of the simulation-to-reality gap is described only at a high level without specific tasks tested, quantitative transfer metrics, adaptation protocols, or success rates. This detail is necessary to substantiate the realism claim for OMNIGIBSON and the benchmark's utility for robot learning.

    Authors: We agree that additional detail on the sim-to-real study is warranted. The current description is intentionally high-level as an initial calibration, but we will expand this section in the revision to specify the exact tasks evaluated (e.g., object pick-and-place sequences in the apartment scene), the quantitative transfer metrics (sim vs. real success rates), the adaptation protocols employed (including domain randomization parameters), and the achieved success rates. These additions will provide a clearer picture of the simulation-to-reality gap. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark defined from external survey and standard physics engine

full rationale

The paper constructs BEHAVIOR-1K by selecting 1,000 activities from an independent human survey and implementing them in the OMNIGIBSON simulator using established rigid-body, deformable, and liquid physics. No equations, fitted parameters, or self-citations are used to derive the claimed long-horizon difficulty or sim-to-real properties; these are presented as direct empirical observations from the defined benchmark rather than reductions to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the survey-derived activity set and the fidelity of the physics simulation to real-world conditions; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption The survey results on desired robot tasks accurately capture human preferences for everyday assistance.
    The 1,000 activities are defined and motivated by the results of this survey.

pith-pipeline@v0.9.0 · 5673 in / 1310 out tokens · 46531 ms · 2026-05-16T23:33:45.785896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 accept novelty 8.0

    SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

  2. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 unverdicted novelty 8.0

    SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.

  3. RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

    cs.RO 2026-04 unverdicted novelty 8.0

    RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.

  4. KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

    cs.RO 2026-04 unverdicted novelty 7.0

    KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.

  5. asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics

    cs.RO 2026-04 unverdicted novelty 7.0

    asRoBallet achieves the first hardware deployment of an end-to-end RL policy for a humanoid ballbot by training in a high-fidelity simulation that models discrete roller mechanics and multi-channel friction for zero-s...

  6. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  7. MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror

    cs.AI 2026-04 unverdicted novelty 7.0

    MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.

  8. HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

    cs.RO 2026-05 unverdicted novelty 6.0

    HeteroGenManip decouples grasp localization from interaction planning using task-conditioned foundation models and multi-model diffusion policies, delivering 31% average gains in broad simulation tasks and 36.7% in fo...

  9. HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

    cs.RO 2026-05 unverdicted novelty 6.0

    A task-conditioned two-stage system decouples grasp localization from interaction trajectory planning using specialized foundation models to improve generalization across heterogeneous object types.

  10. StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

    cs.RO 2026-05 unverdicted novelty 6.0

    StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...

  11. Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation

    cs.RO 2026-05 unverdicted novelty 6.0

    VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.

  12. asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper presents the first successful zero-shot Sim2Real transfer of a friction-aware RL policy for a humanoid ballbot on physical hardware.

  13. EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.

  14. RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    RoboLab is a photorealistic simulation benchmark with 120 tasks and perturbation analysis to evaluate true generalization and robustness of robotic foundation models.

  15. FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    FunRec reconstructs interactable 3D scenes with articulated parts from in-the-wild egocentric interaction videos, automatically discovering parts, estimating kinematics, and producing simulation-compatible meshes with...

  16. RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains

    cs.RO 2026-04 unverdicted novelty 6.0

    RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.

  17. Physically Accurate Rigid-Body Dynamics in Particle-Based Simulation

    cs.RO 2026-03 unverdicted novelty 6.0

    PBD-R adds a momentum-conservation constraint to position-based dynamics to deliver physically accurate rigid-body dynamics while remaining computationally lighter than MuJoCo.

  18. RISE: Self-Improving Robot Policy with Compositional World Model

    cs.RO 2026-02 unverdicted novelty 6.0

    RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.

  19. Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot

    cs.RO 2026-01 unverdicted novelty 6.0

    Genie Sim 3.0 introduces an LLM-powered scene generator, the first LLM-based automated evaluation benchmark, and a large open synthetic dataset that demonstrates zero-shot sim-to-real transfer for robotic manipulation...

  20. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  21. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 17 Pith papers · 9 internal anchors

  1. [1]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition , pages 248–255, 2009. 9

  2. [2]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014

  3. [3]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010

  4. [4]

    Krishna, Y

    R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017

  5. [5]

    Geiger, P

    A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition , pages 3354–3361. IEEE, 2012

  6. [6]

    something something

    R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017

  7. [7]

    G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari. Actor and observer: Joint modeling of first and third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7396–7404, 2018

  8. [8]

    PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

    Y . Xiang, T. Schmidt, V . Narayanan, and D. Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017

  9. [9]

    Martín-Martín, M

    R. Martín-Martín, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese. Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

  10. [10]

    Caba Heilbron, V

    F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015

  11. [11]

    Gurari, Q

    D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018

  12. [12]

    M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Using Large Corpora, page 273, 1994

  13. [13]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task bench- mark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  14. [14]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

  15. [15]

    Socher, A

    R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y . Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, 2013

  16. [16]

    Antol, A

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015

  17. [17]

    Batra, A

    D. Batra, A. X. Chang, S. Chernova, A. J. Davison, J. Deng, V . Koltun, S. Levine, J. Malik, I. Mordatch, R. Mottaghi, M. Savva, and H. Su. Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:2011.01975, 2020. 10

  18. [18]

    Weihs, M

    L. Weihs, M. Deitke, A. Kembhavi, and R. Mottaghi. Visual room rearrangement. arXiv preprint arXiv:2103.16544, 2021

  19. [19]

    C. Gan, S. Zhou, J. Schwartz, S. Alter, A. Bhandwaldar, D. Gutfreund, D. L. Yamins, J. J. DiCarlo, J. McDermott, A. Torralba, et al. The threedworld transport challenge: A visually guided task-and-motion planning benchmark for physically realistic embodied ai. arXiv preprint arXiv:2103.14025, 2021

  20. [20]

    Puig et al

    X. Puig et al. Virtualhome: Simulating household activities via programs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

  21. [21]

    Shridhar, J

    M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 10740–10749, 2020

  22. [22]

    F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 5(2):713–720, 2020

  23. [23]

    T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100. PMLR, 2020

  24. [24]

    James, Z

    S. James, Z. Ma, D. Rovick Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 2020

  25. [25]

    Savva, A

    M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019

  26. [26]

    A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. S. Chaplot, O. Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems, volume 34, 2021

  27. [27]

    Srivastava, C

    S. Srivastava, C. Li, M. Lingelbach, R. Martín-Martín, F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch, K. Liu, et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pages 477–490. PMLR, 2022

  28. [28]

    Y . Zhu, J. Wong, A. Mandlekar, and R. Martín-Martín. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020

  29. [29]

    DeepMind Control Suite

    Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

  30. [30]

    OpenAI Gym

    G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

  31. [31]

    M. L. Littman, I. Ajunwa, G. Berger, C. Boutilier, M. Currie, F. Doshi-Velez, G. Hadfield, M. C. Horowitz, C. Isbell, H. Kitano, et al. Gathering strength, gathering storms: The one hundred year study on artificial intelligence (AI100) 2021 study panel report. Technical report, Stanford University, 2021

  32. [32]

    M. O. Riedl. Human-centered artificial intelligence and machine learning. Human Behavior and Emerging Technologies, 1(1):33–36, 2019

  33. [33]

    W. Xu. Toward human-centered ai: a perspective from human-computer interaction.Interactions, 26(4):42–46, 2019. 11

  34. [34]

    Shneiderman

    B. Shneiderman. Bridging the gap between ethics and practice: guidelines for reliable, safe, and trustworthy human-centered ai systems. ACM Transactions on Interactive Intelligent Systems (TiiS), 10(4):1–31, 2020

  35. [35]

    Kitano, M

    H. Kitano, M. Asada, Y . Kuniyoshi, I. Noda, E. Osawa, and H. Matsubara. Robocup: A challenge problem for ai. AI magazine, 18(1):73–73, 1997

  36. [36]

    Wisspeintner, T

    T. Wisspeintner, T. Van Der Zant, L. Iocchi, and S. Schiffer. Robocup@home: Scientific competition and benchmarking for domestic service robots. Interaction Studies, 10(3):392–426, 2009

  37. [37]

    Iocchi, D

    L. Iocchi, D. Holz, J. Ruiz-del Solar, K. Sugiura, and T. Van Der Zant. Robocup@ home: Anal- ysis and results of evolving competitions for domestic and service robots. Artificial Intelligence, 229:258–281, 2015

  38. [38]

    Buehler, K

    M. Buehler, K. Iagnemma, and S. Singh. The DARPA Urban Challenge: Autonomous Vehicles in City Traffic, volume 56. springer, 2009

  39. [39]

    Krotkov, D

    E. Krotkov, D. Hackett, L. Jackel, M. Perschbacher, J. Pippine, J. Strauss, G. Pratt, and C. Orlowski. The darpa robotics challenge finals: Results and perspectives. Journal of Field Robotics, 34(2):229–240, 2017

  40. [40]

    Correll, K

    N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman. Analysis and observations from the first amazon picking challenge. IEEE Transactions on Automation Science and Engineering, 15(1):172–188, 2016

  41. [41]

    Eppner, S

    C. Eppner, S. Höfer, R. Jonschkowski, R. Martín-Martín, A. Sieverling, V . Wall, and O. Brock. Lessons from the amazon picking challenge: four aspects of building robotic systems. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4831– 4835, 2017

  42. [42]

    M. A. Roa, M. Dogar, C. Vivas, A. Morales, N. Correll, M. Gorner, J. Rosell, S. Foix, R. Memmesheimer, F. Ferro, et al. Mobile manipulation hackathon: Moving into real world applications. IEEE Robotics & Automation Magazine, pages 2–14, 2021

  43. [43]

    Heiden, M

    E. Heiden, M. Macklin, Y . Narang, D. Fox, A. Garg, and F. Ramos. Disect: A differentiable simulation engine for autonomous robotic cutting. arXiv preprint arXiv:2105.12244, 2021

  44. [44]

    Urakami, A

    Y . Urakami, A. Hodgkinson, C. Carlin, R. Leu, L. Rigazio, and P. Abbeel. Doorgym: A scalable door opening environment and baseline agent. arXiv preprint arXiv:1908.01887, 2019

  45. [45]

    X. Lin, Y . Wang, J. Olkin, and D. Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. In Conference on Robot Learning, 2020

  46. [46]

    Nvidia, Corp. Physx. https://developer.nvidia.com/physx-sdk, 2022. Accessed: 2022- 06-10

  47. [47]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  48. [48]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018

  49. [49]

    Jordan and A

    M. Jordan and A. Perez. Optimal bidirectional rapidly-exploring random trees. Technical Report MIT-CSAIL-TR-2013-021, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 2013

  50. [50]

    Bureau of Labor Statistics

    U.S. Bureau of Labor Statistics. American Time Use Survey. https://www.bls.gov/tus/, 2019. 12

  51. [51]

    Harmonised european time use surveys

    European Commission. Harmonised european time use surveys. https://ec.europa.eu/ eurostat/web/time-use-surveys, 2010

  52. [52]

    Gershuny, M

    J. Gershuny, M. Vega-Rapun, and J. Lamote. Multinational time use study. https://www. timeuse.org/mtus, 2020

  53. [53]

    wikiHow, Inc. wikihow. https://www.wikihow.com, 2021. Accessed: 2021-06-16

  54. [54]

    Xiang, Y

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. SAPIEN: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020

  55. [55]

    T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483, 2021

  56. [56]

    H. Fu, W. Xu, H. Xue, H. Yang, R. Ye, Y . Huang, Z. Xue, Y . Wang, and C. Lu. Rfuniverse: A physics-based action-centric interactive environment for everyday household tasks. arXiv preprint arXiv:2202.00199, 2022

  57. [57]

    G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11): 39–41, 1995

  58. [58]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

  59. [59]

    C. Li, F. Xia, R. Martín-Martín, M. Lingelbach, S. Srivastava, B. Shen, K. E. Vainio, C. Gokmen, G. Dharan, T. Jain, A. Kurenkov, K. Liu, H. Gweon, J. Wu, L. Fei-Fei, and S. Savarese. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In Annual Conference on Robot Learning, 2021

  60. [60]

    A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning.Discrete Event Dynamic Systems, 13(1):41–77, 2003

  61. [61]

    S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006

  62. [62]

    J. J. Kuffner and S. M. LaValle. Rrt-connect: An efficient approach to single-query path planning. In Proceedings IEEE International Conference on Robotics and Automation, volume 2, pages 995–1001. IEEE, 2000

  63. [63]

    Ehsani, W

    K. Ehsani, W. Han, A. Herrasti, E. VanderBilt, L. Weihs, E. Kolve, A. Kembhavi, and R. Mot- taghi. Manipulathor: A framework for visual object manipulation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 4497–4506, 2021

  64. [64]

    C. Li, F. Xia, R. Martín-Martín, and S. Savarese. Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators. In Conference on Robot Learning, pages 603–616. PMLR, 2020

  65. [65]

    Alipov, R

    V . Alipov, R. Simmons-Edler, N. Putintsev, P. Kalinin, and D. Vetrov. Towards practical credit assignment for deep reinforcement learning. arXiv preprint arXiv:2106.04499, 2021

  66. [66]

    T. Yang, H. Tang, C. Bai, J. Liu, J. Hao, Z. Meng, and P. Liu. Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668, 2021

  67. [67]

    Osband, B

    I. Osband, B. V . Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019

  68. [68]

    Hochreiter

    S. Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 6(2):107–116, 1998. 13

  69. [69]

    F. Xia, C. Li, R. Martín-Martín, O. Litany, A. Toshev, and S. Savarese. ReLMoGen: Leveraging motion generation in reinforcement learning for mobile manipulation. In IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020

  70. [70]

    YOLOv3: An Incremental Improvement

    J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018

  71. [71]

    Bjelonic

    M. Bjelonic. YOLO ROS: Real-time object detection for ROS. https://github.com/ leggedrobotics/darknet_ros, 2016–2018

  72. [72]

    Thrun, W

    S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, 2005

  73. [73]

    L. Fan, G. Wang, D.-A. Huang, Z. Yu, L. Fei-Fei, Y . Zhu, and A. Anandkumar. Secant: Self- expert cloning for zero-shot generalization of visual policies. arXiv preprint arXiv:2106.09678, 2021

  74. [74]

    WikiHow: A Large Scale Text Summarization Dataset

    M. Koupaee and W. Y . Wang. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018

  75. [75]

    R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932

  76. [76]

    D. C. Montgomery. Design and analysis of experiments. John Wiley & Sons, Inc., Hoboken, NJ, eighth edition, 2013

  77. [77]

    Paolacci, J

    G. Paolacci, J. Chandler, and P. G. Ipeirotis. Running experiments on amazon mechanical turk. Judgment and Decision making, 5(5):411–419, 2010

  78. [78]

    P. R. Center. Research in the crowdsourcing age, a case study. Technical report, Wash- ington, D.C., July 2016. URL https://www.pewresearch.org/internet/2016/07/11/ research-in-the-crowdsourcing-age-a-case-study/

  79. [79]

    Akbik, D

    A. Akbik, D. Blythe, and R. V ollgraf. Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649, 2018

  80. [80]

    Fox and D

    M. Fox and D. Long. PDDL2.1: An extension to PDDL for expressing temporal planning domains. Journal of Artificial Intelligence Research, 20:61–124, dec 2003. doi:10.1613/jair.1129. URLhttps://doi.org/10.1613%2Fjair.1129

Showing first 80 references.