pith. sign in

arxiv: 2605.14504 · v2 · pith:H6KWZFSQnew · submitted 2026-05-14 · 💻 cs.AI

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

Pith reviewed 2026-05-20 21:21 UTC · model grok-4.3

classification 💻 cs.AI
keywords long-horizon planninghousehold tasksembodied AIvision-language modelsbenchmarkhierarchical plannermemory systemstask execution
0
0 comments X

The pith

LongAct benchmark shows current models complete just 16% of long household tasks despite new planning agent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LongAct as a benchmark for long-horizon household tasks given through free-form instructions, deliberately setting aside low-level robot movements to test higher-level skills such as managing task dependencies and maintaining memory over time. It also presents HoloMind, an agent built around a directed acyclic graph planner, persistent spatial memory, episodic memory for reusing past experiences, and a critic that reviews progress. Experiments using models like GPT-5 and Qwen3-VL demonstrate that HoloMind raises performance, yet even the best results reach only 59% goal completion and 16% full-task success. This gap indicates that sustained reasoning across many steps remains a core limitation for embodied agents operating in realistic home settings.

Core claim

LongAct is established as a benchmark that evaluates planning-level autonomy on extended household tasks described in natural language, by abstracting away embodiment-specific control to isolate capabilities including instruction understanding, dependency management, memory maintenance, and adaptive planning. HoloMind is introduced as a VLM-driven agent that combines a DAG-based long-horizon hierarchical planner, multimodal spatial memory for persistent world modeling, episodic memory for experience reuse, and a global critic for reflective supervision. Experiments show HoloMind substantially improves long-horizon performance while lowering dependence on model scale, although top models are仍

What carries the argument

HoloMind, a VLM-driven agent that integrates a DAG-based hierarchical planner, multimodal spatial memory, episodic memory, and a global critic to support sustained reasoning in long-horizon tasks.

If this is right

  • Hierarchical DAG planning enables better breakdown and ordering of dependent subtasks in extended household sequences.
  • Multimodal spatial and episodic memory modules support consistent world modeling and reuse of prior experience across steps.
  • A global critic provides reflective supervision that improves adaptation when plans encounter unexpected changes.
  • Performance gains from HoloMind hold across different underlying VLMs, showing architecture matters more than raw model size alone.
  • The benchmark's low ceiling of 16% full success highlights the need for further advances in long-horizon reasoning for embodied agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory-and-planner structure could transfer to other extended activities such as multi-step assembly or sequential caregiving.
  • Connecting LongAct to physical robot platforms would test whether high-level plans survive real sensor noise and actuation limits.
  • Failure patterns on the benchmark may identify specific instruction ambiguities that targeted training data could address.

Load-bearing premise

Abstracting away embodiment-specific low-level control isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning.

What would settle it

Demonstrating that a baseline VLM without the DAG planner, spatial memory, episodic memory, or critic achieves 16% or higher full-task success on LongAct would indicate the specialized components are not required for the observed gains.

Figures

Figures reproduced from arXiv: 2605.14504 by Bowen Pang, Jing Liu, Longteng Guo, Ruyi Ji, Xingjian He, Yanghong Mei, Zilin Zhu, Zongxun Zhang.

Figure 1
Figure 1. Figure 1: Overview of LongAct Bench. LongAct evaluates agents on long-horizon household tasks that span 500+ human steps and require tightly coupled navigation and manipulation across multi-room environments. The benchmark emphasizes persistent reasoning, memory, and error recovery over thousands of actions, revealing the challenges of achieving reliable long-horizon autonomy. To bridge this gap, we introduce LongAc… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the LongAct benchmark construction pipeline. Each episode includes a multi-room house environment, a multi-goal long-horizon task, and a final-state checklist for evaluation. execution. Second, the Critic module also proves crucial for stable execution: removing it reduces accuracy by roughly 40% and leads to a deterioration of up to 90% in manipulation efficiency and Improvement Rate, demonstr… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the HoloMind framework. HoloMind consists of four modules: a two-layer Planner, an Executor, a Memory, and a Critic. The Planner takes the task T and, with support from Memory, incrementally decomposes it into executable natural-language instructions Ia. The Executor uses navigation and manipulation skill libraries to convert Ia and observations Oi into atomic simulator actions A. Throughout ex… view at source ↗
Figure 4
Figure 4. Figure 4: A sample trajectory visualization of HoloMind on LongAct Bench. The agent decomposes the task into five goals executed through interleaved navigation and manipulation steps. When encountering an issue (highlighted in red), HoloMind analyzes the cause of failure and adjusts its strategy accordingly. structure, scaling alone provides limited gains: Qwen3-VL-32B achieves only 6.14% GC and fails entirely in SR… view at source ↗
Figure 5
Figure 5. Figure 5: Human and agent scoring trajectories Com￾parison. What Types of Errors Remain? [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Error distribution across task planning, memory, and execution. This highlights that the most embodied forms of interaction—fine-grained manipulation—remain the most challenging aspect for an LLM-based agent. Manip: Take the bread from the kitchen manip: Slice bread Manip: Take a slice Manip: Navigate to kitchen counter -Critic- "The bread is currently being held by the agent, but slicing requires the obje… view at source ↗
Figure 7
Figure 7. Figure 7: Demonstration of critic-assisted correction during execution. When the agent fails to execute Slice bread, the Critic inspects the failure, identifies that the bread is not placed on a surface, and proposes a corrective action while updating its experience [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Demonstration of sequential object handling with persistent identity tracking. HoloMind is able to pick up one pen, correctly place it, and then pick up the other, maintaining a stable understanding that the two pens are distinct objects across the entire sequence of actions. accumulation. Such “getting better over time” behavior is often more indicative of long-term utility than a single static measure of… view at source ↗
Figure 9
Figure 9. Figure 9: Demonstration of a failure case with unresolved errors. The agent incorrectly places the first book but fails to detect the mistake. When attempting to pick up the second book, the system becomes inconsistent, the Critic cannot recover, and control is returned to the planner, terminating the task. Having outlined the motivation for IR and the key intuition behind its design, we now provide a formal definit… view at source ↗
Figure 10
Figure 10. Figure 10: Multimodal evidence used for memory retrieval. The LLM receives visual candidates from CLIP and determines whether each image matches the query description, enabling disambiguation in cluttered or visually similar scenes. we partition the sequence uniformly into k contiguous intervals: Sk,j =  st | t ∈ Ik,j , j = 1, . . . , k. (D5) These intervals reflect different temporal resolutions: • Small k: coarse… view at source ↗
Figure 11
Figure 11. Figure 11: Power-law curves for IR values from 0 to 2. We numerically determine the exponent a such that st = (t/T) a produces the desired IR value, yielding a continuous sweep of geometric shapes. 0.0 0.2 0.4 0.6 0.8 1.0 Normalized step t/T 0.0 0.2 0.4 0.6 0.8 1.0 N orm aliz e d s c ore st Synthetic Power-Law Score Curves for Different Agents Human (IR=1.78) GPT-5 (IR=1.70) GPT-5-mini (IR=1.16) Qwen3-VL-32B (IR=1.6… view at source ↗
read the original abstract

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LongAct, a benchmark for long-horizon household tasks specified via free-form instructions. By abstracting away embodiment-specific low-level control, LongAct is claimed to isolate high-level capabilities including instruction understanding, dependency management, memory maintenance, and adaptive planning. The authors also propose HoloMind, a VLM-driven agent using a DAG-based long-horizon hierarchical planner, Multimodal Spatial Memory, Episodic Memory, and a global Critic. Experiments with GPT-5 and Qwen3-VL show HoloMind improves performance over baselines, yet even top models reach only 59% goal completion and 16% full-task success, underscoring the benchmark's difficulty.

Significance. If the benchmark definitions and results prove reproducible, the work could usefully redirect embodied AI research toward sustained long-horizon planning rather than short-horizon navigation or manipulation. The concrete agent architecture (DAG planner plus dual memories and critic) supplies testable components that future systems can adopt or ablate. The reported performance gap supplies a clear, falsifiable target for progress in memory and adaptive reasoning.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The reported figures (59% goal completion, 16% full-task success) are presented without accompanying details on task definitions, success criteria, number of trials per task, or controls for prompting variations. This absence prevents assessment of whether the numbers genuinely demonstrate LongAct's difficulty or whether design choices in task specification or evaluation protocol inflate or deflate the measured gap.
  2. [Benchmark design] Benchmark design (likely §3): The central claim that abstracting low-level control cleanly isolates high-level cognition rests on an unverified separation. The manuscript does not specify the observation model (ground-truth poses versus partial observability), the precise action interface, or how simulator state encodes execution feasibility. Without these details the 59%/16% figures could partly reflect residual low-level reasoning demands rather than deficits in planning or memory alone.
minor comments (2)
  1. [Agent architecture] Clarify whether the DAG planner is constructed from the instruction or learned; the current description leaves the construction process ambiguous.
  2. [Benchmark] Add a table summarizing task categories, average horizon length, and success criteria to make the benchmark concrete for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional clarifications and details as requested.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The reported figures (59% goal completion, 16% full-task success) are presented without accompanying details on task definitions, success criteria, number of trials per task, or controls for prompting variations. This absence prevents assessment of whether the numbers genuinely demonstrate LongAct's difficulty or whether design choices in task specification or evaluation protocol inflate or deflate the measured gap.

    Authors: We agree that greater specificity on these elements improves reproducibility and allows readers to better evaluate the results. In the revised manuscript we have expanded the Experiments section with a new subsection that explicitly defines each task, states the success criteria (goal completion requires all sub-goals to be satisfied within simulator tolerances; full-task success requires exact adherence to the intended sequence without extraneous actions), reports that each task was evaluated over 10 independent trials, and describes the prompting controls (fixed template with minor lexical variations tested for sensitivity). These additions confirm that the reported 59 % goal-completion and 16 % full-task-success rates are stable across trials and prompting conditions and reflect the benchmark’s intrinsic difficulty. revision: yes

  2. Referee: [Benchmark design] Benchmark design (likely §3): The central claim that abstracting low-level control cleanly isolates high-level cognition rests on an unverified separation. The manuscript does not specify the observation model (ground-truth poses versus partial observability), the precise action interface, or how simulator state encodes execution feasibility. Without these details the 59%/16% figures could partly reflect residual low-level reasoning demands rather than deficits in planning or memory alone.

    Authors: We acknowledge that the original description of the abstraction could be more precise. Section 3 already states that LongAct operates at the planning level by providing a high-level action space, but we have now added explicit specifications: the observation model supplies ground-truth object poses and states to the planner while the agent’s VLM perception module operates under partial observability; the action interface consists of discrete high-level commands (e.g., “navigate to X”, “pick up Y”) whose low-level execution is assumed perfect by the benchmark; and the simulator encodes execution feasibility via precondition checks performed by the DAG planner before any action is issued. These clarifications have been inserted into the revised §3 and the Experiments section. We maintain that the abstraction isolates high-level capabilities, yet we agree the added detail removes any ambiguity about residual low-level demands. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and agent proposal with independent design choices

full rationale

The paper introduces LongAct as a benchmark that abstracts low-level control to isolate high-level planning capabilities and proposes HoloMind with specific modules (DAG planner, spatial/episodic memory, critic). These are explicit design decisions and empirical evaluations on GPT-5/Qwen3-VL models, not derivations, equations, or predictions that reduce to fitted parameters or self-citations by construction. No load-bearing steps equate outputs to inputs; results (59% goal completion, 16% full success) are reported outcomes rather than forced quantities. The work is self-contained and externally falsifiable via the benchmark tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 4 invented entities

Central claims rest on the design choice to isolate high-level planning by removing low-level control and on the effectiveness of the newly introduced agent components; both are introduced without external validation or prior independent evidence in the abstract.

axioms (2)
  • domain assumption Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities
    Stated directly as the motivation for creating LongAct in the abstract.
  • domain assumption Abstracting away embodiment-specific low-level control isolates high-level cognitive capabilities
    Core methodological premise for the benchmark design stated in the abstract.
invented entities (4)
  • DAG-based long-horizon hierarchical planner no independent evidence
    purpose: To structure planning for extended task sequences
    New component introduced as part of HoloMind.
  • Multimodal Spatial Memory no independent evidence
    purpose: Persistent world modeling
    New memory module introduced in HoloMind.
  • Episodic Memory no independent evidence
    purpose: Experience reuse
    New memory module introduced in HoloMind.
  • global Critic no independent evidence
    purpose: Reflective supervision
    New supervision module introduced in HoloMind.

pith-pipeline@v0.9.0 · 5742 in / 1600 out tokens · 55625 ms · 2026-05-20T21:21:04.353535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 12 internal anchors

  1. [1]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:1–19, 2023

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

  5. [5]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  6. [6]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 10

  7. [7]

    Reverie: Remote embodied visual referring expression in real indoor environments

    Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020

  8. [8]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

  9. [9]

    Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation

    Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5543–5550. IEEE, 2024

  10. [10]

    Goat-bench: A benchmark for multi-modal lifelong navigation

    Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16373–16383, 2024

  11. [11]

    Soundspaces: Audio-visual navigation in 3d environments

    Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. Soundspaces: Audio-visual navigation in 3d environments. InEuropean conference on computer vision, pages 17–36. Springer, 2020

  12. [12]

    Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

    Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

  13. [13]

    C-NAV: Towards Self-Evolving Continual Object Navigation in Open World

    Ming-Ming Yu, Fei Zhu, Wenzhuo Liu, Yirong Yang, Qunbo Wang, Wenjun Wu, and Jing Liu. C-nav: Towards self-evolving continual object navigation in open world.arXiv preprint arXiv:2510.20685, 2025

  14. [14]

    Robotwin: Dual-arm robot benchmark with generative digital twins (early version)

    Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version). InEuropean Conference on Computer Vision, pages 264–273. Springer, 2024

  15. [15]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettle- moyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020

  16. [16]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  17. [17]

    Rearrangement: A challenge for embodied AI,

    Dhruv Batra, Angel X Chang, Sonia Chernova, Andrew J Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, et al. Rearrangement: A challenge for embodied ai.arXiv preprint arXiv:2011.01975, 2020

  18. [18]

    The threedworld transport challenge: A visually guided task-and-motion planning benchmark towards physically realistic embodied ai

    Chuang Gan, Siyuan Zhou, Jeremy Schwartz, Seth Alter, Abhishek Bhandwaldar, Dan Gutfreund, Daniel LK Yamins, James J DiCarlo, Josh McDermott, Antonio Torralba, et al. The threedworld transport challenge: A visually guided task-and-motion planning benchmark towards physically realistic embodied ai. In2022 International conference on robotics and automation...

  19. [19]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

  20. [20]

    Tutorial on directed acyclic graphs.Journal of Clinical Epidemiology, 142:264–267, 2022

    Jean C Digitale, Jeffrey N Martin, and Medellena Maria Glymour. Tutorial on directed acyclic graphs.Journal of Clinical Epidemiology, 142:264–267, 2022

  21. [21]

    Openai gpt-5 system card, 2025

  22. [22]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

  23. [23]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

  24. [24]

    arXiv preprint arXiv:2010.09890 , year=

    Xavier Puig, Tianmin Shu, Shuang Li, Zilin Wang, Yuan-Hong Liao, Joshua B Tenenbaum, Sanja Fidler, and Antonio Torralba. Watch-and-help: A challenge for social perception and human-ai collaboration.arXiv preprint arXiv:2010.09890, 2020. 11

  25. [25]

    Lota-bench: Bench- marking language-oriented task planners for embodied agents,

    Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. Lota-bench: Benchmarking language-oriented task planners for embodied agents.arXiv preprint arXiv:2402.08178, 2024

  26. [26]

    Goat-bench: A benchmark for multi-modal lifelong navigation, 2024

    Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation, 2024

  27. [27]

    Karen Liu, Jiajun Wu, and Li Fei-Fei

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R ...

  28. [28]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  29. [29]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  30. [30]

    Llm-state: Expandable state representation for long-horizon task planning in the open world.CoRR, 2023

    Siwei Chen, Anxing Xiao, and David Hsu. Llm-state: Expandable state representation for long-horizon task planning in the open world.CoRR, 2023

  31. [31]

    SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,

    Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023

  32. [32]

    ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022

  33. [33]

    Film: Following instructions in language with modular methods.arXiv preprint arXiv:2110.07342, 2021

    So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Following instructions in language with modular methods.arXiv preprint arXiv:2110.07342, 2021

  34. [34]

    Epo: Hierarchical llm agents with environment preference optimization.arXiv preprint arXiv:2408.16090, 2024

    Qi Zhao, Haotian Fu, Chen Sun, and George Konidaris. Epo: Hierarchical llm agents with environment preference optimization.arXiv preprint arXiv:2408.16090, 2024

  35. [35]

    Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks

    Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Jinrui Liu, Haoran Li, Dongbin Zhao, and He Wang. Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks. IEEE Transactions on Cognitive and Developmental Systems, 2025

  36. [36]

    Llm-planner: Few-shot grounded planning for embodied agents with large language models

    Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023

  37. [37]

    Context-aware planning and environment-aware memory for instruction following embodied agents

    Byeonghwi Kim, Jinyeon Kim, Yuyeong Kim, Cheolhong Min, and Jonghyun Choi. Context-aware planning and environment-aware memory for instruction following embodied agents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10936–10946, 2023

  38. [38]

    Flowplan: Zero-shot task planning with llm flow engineering for robotic instruction following.arXiv preprint arXiv:2503.02698, 2025

    Zijun Lin, Chao Tang, Hanjing Ye, and Hong Zhang. Flowplan: Zero-shot task planning with llm flow engineering for robotic instruction following.arXiv preprint arXiv:2503.02698, 2025

  39. [39]

    Procthor: Large-scale embodied ai using procedural generation, 2022

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation, 2022

  40. [40]

    Ai2-thor: An interactive 3d environment for visual ai, 2022

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022

  41. [41]

    Visual language maps for robot navigation, 2023

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation, 2023

  42. [42]

    Audio visual language maps for robot navigation

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Audio visual language maps for robot navigation. InProceedings of the International Symposium on Experimental Robotics (ISER), Chiang Mai, Thailand, 2023

  43. [43]

    Interactive

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 12 Appendix A Comparison with Existing Embodied Benchm...