When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
Pith reviewed 2026-05-20 21:21 UTC · model grok-4.3
The pith
LongAct benchmark shows current models complete just 16% of long household tasks despite new planning agent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongAct is established as a benchmark that evaluates planning-level autonomy on extended household tasks described in natural language, by abstracting away embodiment-specific control to isolate capabilities including instruction understanding, dependency management, memory maintenance, and adaptive planning. HoloMind is introduced as a VLM-driven agent that combines a DAG-based long-horizon hierarchical planner, multimodal spatial memory for persistent world modeling, episodic memory for experience reuse, and a global critic for reflective supervision. Experiments show HoloMind substantially improves long-horizon performance while lowering dependence on model scale, although top models are仍
What carries the argument
HoloMind, a VLM-driven agent that integrates a DAG-based hierarchical planner, multimodal spatial memory, episodic memory, and a global critic to support sustained reasoning in long-horizon tasks.
If this is right
- Hierarchical DAG planning enables better breakdown and ordering of dependent subtasks in extended household sequences.
- Multimodal spatial and episodic memory modules support consistent world modeling and reuse of prior experience across steps.
- A global critic provides reflective supervision that improves adaptation when plans encounter unexpected changes.
- Performance gains from HoloMind hold across different underlying VLMs, showing architecture matters more than raw model size alone.
- The benchmark's low ceiling of 16% full success highlights the need for further advances in long-horizon reasoning for embodied agents.
Where Pith is reading between the lines
- The same memory-and-planner structure could transfer to other extended activities such as multi-step assembly or sequential caregiving.
- Connecting LongAct to physical robot platforms would test whether high-level plans survive real sensor noise and actuation limits.
- Failure patterns on the benchmark may identify specific instruction ambiguities that targeted training data could address.
Load-bearing premise
Abstracting away embodiment-specific low-level control isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning.
What would settle it
Demonstrating that a baseline VLM without the DAG planner, spatial memory, episodic memory, or critic achieves 16% or higher full-task success on LongAct would indicate the specialized components are not required for the observed gains.
Figures
read the original abstract
Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LongAct, a benchmark for long-horizon household tasks specified via free-form instructions. By abstracting away embodiment-specific low-level control, LongAct is claimed to isolate high-level capabilities including instruction understanding, dependency management, memory maintenance, and adaptive planning. The authors also propose HoloMind, a VLM-driven agent using a DAG-based long-horizon hierarchical planner, Multimodal Spatial Memory, Episodic Memory, and a global Critic. Experiments with GPT-5 and Qwen3-VL show HoloMind improves performance over baselines, yet even top models reach only 59% goal completion and 16% full-task success, underscoring the benchmark's difficulty.
Significance. If the benchmark definitions and results prove reproducible, the work could usefully redirect embodied AI research toward sustained long-horizon planning rather than short-horizon navigation or manipulation. The concrete agent architecture (DAG planner plus dual memories and critic) supplies testable components that future systems can adopt or ablate. The reported performance gap supplies a clear, falsifiable target for progress in memory and adaptive reasoning.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The reported figures (59% goal completion, 16% full-task success) are presented without accompanying details on task definitions, success criteria, number of trials per task, or controls for prompting variations. This absence prevents assessment of whether the numbers genuinely demonstrate LongAct's difficulty or whether design choices in task specification or evaluation protocol inflate or deflate the measured gap.
- [Benchmark design] Benchmark design (likely §3): The central claim that abstracting low-level control cleanly isolates high-level cognition rests on an unverified separation. The manuscript does not specify the observation model (ground-truth poses versus partial observability), the precise action interface, or how simulator state encodes execution feasibility. Without these details the 59%/16% figures could partly reflect residual low-level reasoning demands rather than deficits in planning or memory alone.
minor comments (2)
- [Agent architecture] Clarify whether the DAG planner is constructed from the instruction or learned; the current description leaves the construction process ambiguous.
- [Benchmark] Add a table summarizing task categories, average horizon length, and success criteria to make the benchmark concrete for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional clarifications and details as requested.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The reported figures (59% goal completion, 16% full-task success) are presented without accompanying details on task definitions, success criteria, number of trials per task, or controls for prompting variations. This absence prevents assessment of whether the numbers genuinely demonstrate LongAct's difficulty or whether design choices in task specification or evaluation protocol inflate or deflate the measured gap.
Authors: We agree that greater specificity on these elements improves reproducibility and allows readers to better evaluate the results. In the revised manuscript we have expanded the Experiments section with a new subsection that explicitly defines each task, states the success criteria (goal completion requires all sub-goals to be satisfied within simulator tolerances; full-task success requires exact adherence to the intended sequence without extraneous actions), reports that each task was evaluated over 10 independent trials, and describes the prompting controls (fixed template with minor lexical variations tested for sensitivity). These additions confirm that the reported 59 % goal-completion and 16 % full-task-success rates are stable across trials and prompting conditions and reflect the benchmark’s intrinsic difficulty. revision: yes
-
Referee: [Benchmark design] Benchmark design (likely §3): The central claim that abstracting low-level control cleanly isolates high-level cognition rests on an unverified separation. The manuscript does not specify the observation model (ground-truth poses versus partial observability), the precise action interface, or how simulator state encodes execution feasibility. Without these details the 59%/16% figures could partly reflect residual low-level reasoning demands rather than deficits in planning or memory alone.
Authors: We acknowledge that the original description of the abstraction could be more precise. Section 3 already states that LongAct operates at the planning level by providing a high-level action space, but we have now added explicit specifications: the observation model supplies ground-truth object poses and states to the planner while the agent’s VLM perception module operates under partial observability; the action interface consists of discrete high-level commands (e.g., “navigate to X”, “pick up Y”) whose low-level execution is assumed perfect by the benchmark; and the simulator encodes execution feasibility via precondition checks performed by the DAG planner before any action is issued. These clarifications have been inserted into the revised §3 and the Experiments section. We maintain that the abstraction isolates high-level capabilities, yet we agree the added detail removes any ambiguity about residual low-level demands. revision: yes
Circularity Check
No circularity: empirical benchmark and agent proposal with independent design choices
full rationale
The paper introduces LongAct as a benchmark that abstracts low-level control to isolate high-level planning capabilities and proposes HoloMind with specific modules (DAG planner, spatial/episodic memory, critic). These are explicit design decisions and empirical evaluations on GPT-5/Qwen3-VL models, not derivations, equations, or predictions that reduce to fitted parameters or self-citations by construction. No load-bearing steps equate outputs to inputs; results (59% goal completion, 16% full success) are reported outcomes rather than forced quantities. The work is self-contained and externally falsifiable via the benchmark tasks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities
- domain assumption Abstracting away embodiment-specific low-level control isolates high-level cognitive capabilities
invented entities (4)
-
DAG-based long-horizon hierarchical planner
no independent evidence
-
Multimodal Spatial Memory
no independent evidence
-
Episodic Memory
no independent evidence
-
global Critic
no independent evidence
Reference graph
Works this paper leans on
-
[1]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:1–19, 2023
work page 2023
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Reverie: Remote embodied visual referring expression in real indoor environments
Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020
work page 2020
-
[8]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018
work page 2018
-
[9]
Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation
Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5543–5550. IEEE, 2024
work page 2024
-
[10]
Goat-bench: A benchmark for multi-modal lifelong navigation
Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16373–16383, 2024
work page 2024
-
[11]
Soundspaces: Audio-visual navigation in 3d environments
Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. Soundspaces: Audio-visual navigation in 3d environments. InEuropean conference on computer vision, pages 17–36. Springer, 2020
work page 2020
-
[12]
Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020
-
[13]
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World
Ming-Ming Yu, Fei Zhu, Wenzhuo Liu, Yirong Yang, Qunbo Wang, Wenjun Wu, and Jing Liu. C-nav: Towards self-evolving continual object navigation in open world.arXiv preprint arXiv:2510.20685, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Robotwin: Dual-arm robot benchmark with generative digital twins (early version)
Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version). InEuropean Conference on Computer Vision, pages 264–273. Springer, 2024
work page 2024
-
[15]
Alfred: A benchmark for interpreting grounded instructions for everyday tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettle- moyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020
work page 2020
-
[16]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[17]
Rearrangement: A challenge for embodied AI,
Dhruv Batra, Angel X Chang, Sonia Chernova, Andrew J Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, et al. Rearrangement: A challenge for embodied ai.arXiv preprint arXiv:2011.01975, 2020
-
[18]
Chuang Gan, Siyuan Zhou, Jeremy Schwartz, Seth Alter, Abhishek Bhandwaldar, Dan Gutfreund, Daniel LK Yamins, James J DiCarlo, Josh McDermott, Antonio Torralba, et al. The threedworld transport challenge: A visually guided task-and-motion planning benchmark towards physically realistic embodied ai. In2022 International conference on robotics and automation...
work page 2022
-
[19]
AI2-THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Tutorial on directed acyclic graphs.Journal of Clinical Epidemiology, 142:264–267, 2022
Jean C Digitale, Jeffrey N Martin, and Medellena Maria Glymour. Tutorial on directed acyclic graphs.Journal of Clinical Epidemiology, 142:264–267, 2022
work page 2022
-
[21]
Openai gpt-5 system card, 2025
work page 2025
-
[22]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019
work page 2019
-
[23]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
arXiv preprint arXiv:2010.09890 , year=
Xavier Puig, Tianmin Shu, Shuang Li, Zilin Wang, Yuan-Hong Liao, Joshua B Tenenbaum, Sanja Fidler, and Antonio Torralba. Watch-and-help: A challenge for social perception and human-ai collaboration.arXiv preprint arXiv:2010.09890, 2020. 11
-
[25]
Lota-bench: Bench- marking language-oriented task planners for embodied agents,
Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. Lota-bench: Benchmarking language-oriented task planners for embodied agents.arXiv preprint arXiv:2402.08178, 2024
-
[26]
Goat-bench: A benchmark for multi-modal lifelong navigation, 2024
Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation, 2024
work page 2024
-
[27]
Karen Liu, Jiajun Wu, and Li Fei-Fei
Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R ...
work page 2024
-
[28]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Siwei Chen, Anxing Xiao, and David Hsu. Llm-state: Expandable state representation for long-horizon task planning in the open world.CoRR, 2023
work page 2023
-
[31]
SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,
Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023
-
[32]
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Film: Following instructions in language with modular methods.arXiv preprint arXiv:2110.07342, 2021
So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Following instructions in language with modular methods.arXiv preprint arXiv:2110.07342, 2021
-
[34]
Qi Zhao, Haotian Fu, Chen Sun, and George Konidaris. Epo: Hierarchical llm agents with environment preference optimization.arXiv preprint arXiv:2408.16090, 2024
-
[35]
Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks
Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Jinrui Liu, Haoran Li, Dongbin Zhao, and He Wang. Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks. IEEE Transactions on Cognitive and Developmental Systems, 2025
work page 2025
-
[36]
Llm-planner: Few-shot grounded planning for embodied agents with large language models
Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023
work page 2023
-
[37]
Context-aware planning and environment-aware memory for instruction following embodied agents
Byeonghwi Kim, Jinyeon Kim, Yuyeong Kim, Cheolhong Min, and Jonghyun Choi. Context-aware planning and environment-aware memory for instruction following embodied agents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10936–10946, 2023
work page 2023
-
[38]
Zijun Lin, Chao Tang, Hanjing Ye, and Hong Zhang. Flowplan: Zero-shot task planning with llm flow engineering for robotic instruction following.arXiv preprint arXiv:2503.02698, 2025
-
[39]
Procthor: Large-scale embodied ai using procedural generation, 2022
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation, 2022
work page 2022
-
[40]
Ai2-thor: An interactive 3d environment for visual ai, 2022
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022
work page 2022
-
[41]
Visual language maps for robot navigation, 2023
Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation, 2023
work page 2023
-
[42]
Audio visual language maps for robot navigation
Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Audio visual language maps for robot navigation. InProceedings of the International Symposium on Experimental Robotics (ISER), Chiang Mai, Thailand, 2023
work page 2023
-
[43]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 12 Appendix A Comparison with Existing Embodied Benchm...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.