arxiv: 2605.14504 · v1 · submitted 2026-05-14 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

Zilin Zhu , Longteng Guo , Yanghong Mei , Bowen Pang , Zongxun Zhang , Xingjian He , Ruyi Ji , Jing Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords long-horizon planninghousehold roboticsembodied AIVLM agentbenchmarkspatial memoryepisodic memoryhierarchical planner

0 comments

The pith

HoloMind agent with DAG planner and dual memories raises long-horizon household task success while cutting dependence on model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LongAct, a benchmark that tests agents on extended household chores described in free-form instructions by stripping away low-level robot controls. This isolates the need for instruction parsing, dependency tracking, world-state memory, and plan adaptation across dozens of steps. The authors build HoloMind around a directed-acyclic-graph hierarchical planner, a Multimodal Spatial Memory that keeps an updatable scene model, an Episodic Memory that reuses prior traces, and a global Critic that reviews progress. When paired with GPT-5 or Qwen3-VL, these modules lift goal completion and reduce the performance gap between small and large backbones. Yet even the strongest configurations reach only 59 percent goal completion and 16 percent full-task success, showing that sustained reasoning remains a bottleneck.

Core claim

LongAct evaluates high-level autonomy in long-horizon household tasks by abstracting embodiment-specific control, and HoloMind improves results through a DAG-based hierarchical planner, Multimodal Spatial Memory for persistent world modeling, Episodic Memory for experience reuse, and a global Critic for reflective supervision, all while lowering reliance on base-model scale.

What carries the argument

DAG-based long-horizon hierarchical planner combined with Multimodal Spatial Memory, Episodic Memory, and global Critic.

If this is right

Structured memory and explicit planning let agents manage task dependencies across many steps without larger models.
Free-form instructions expose gaps in current VLMs that fixed-category benchmarks hide.
Full-task success stays low even when goal completion improves, indicating that plan repair and recovery remain unsolved.
The same architecture could be ported to other long-sequence domains once low-level control is reattached.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The abstraction may let researchers iterate on planning logic faster than full robot simulators allow.
If memory modules prove decisive, future agents could store compact scene graphs rather than raw video histories.
Extending LongAct with cost or time metrics would test whether the planner also produces efficient sequences.
Hybrid systems that keep explicit memory may scale better than pure end-to-end VLMs for chores lasting tens of minutes.

Load-bearing premise

High-level cognitive skills such as instruction understanding and adaptive planning can be measured accurately once low-level physical control is removed from the task.

What would settle it

A controlled test that adds realistic low-level actuation noise or partial observability to LongAct tasks and checks whether HoloMind's reported gains over baseline VLMs disappear.

Figures

Figures reproduced from arXiv: 2605.14504 by Bowen Pang, Jing Liu, Longteng Guo, Ruyi Ji, Xingjian He, Yanghong Mei, Zilin Zhu, Zongxun Zhang.

**Figure 1.** Figure 1: Overview of LongAct Bench. LongAct evaluates agents on long-horizon household tasks that span 500+ human steps and require tightly coupled navigation and manipulation across multi-room environments. The benchmark emphasizes persistent reasoning, memory, and error recovery over thousands of actions, revealing the challenges of achieving reliable long-horizon autonomy. To bridge this gap, we introduce LongAc… view at source ↗

**Figure 2.** Figure 2: Overview of the LongAct benchmark construction pipeline. Each episode includes a multi-room house environment, a multi-goal long-horizon task, and a final-state checklist for evaluation. execution. Second, the Critic module also proves crucial for stable execution: removing it reduces accuracy by roughly 40% and leads to a deterioration of up to 90% in manipulation efficiency and Improvement Rate, demonstr… view at source ↗

**Figure 3.** Figure 3: Overview of the HoloMind framework. HoloMind consists of four modules: a two-layer Planner, an Executor, a Memory, and a Critic. The Planner takes the task T and, with support from Memory, incrementally decomposes it into executable natural-language instructions Ia. The Executor uses navigation and manipulation skill libraries to convert Ia and observations Oi into atomic simulator actions A. Throughout ex… view at source ↗

**Figure 4.** Figure 4: A sample trajectory visualization of HoloMind on LongAct Bench. The agent decomposes the task into five goals executed through interleaved navigation and manipulation steps. When encountering an issue (highlighted in red), HoloMind analyzes the cause of failure and adjusts its strategy accordingly. structure, scaling alone provides limited gains: Qwen3-VL-32B achieves only 6.14% GC and fails entirely in SR… view at source ↗

**Figure 5.** Figure 5: Human and agent scoring trajectories Comparison. What Types of Errors Remain? [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Error distribution across task planning, memory, and execution. This highlights that the most embodied forms of interaction—fine-grained manipulation—remain the most challenging aspect for an LLM-based agent. Manip: Take the bread from the kitchen manip: Slice bread Manip: Take a slice Manip: Navigate to kitchen counter -Critic- "The bread is currently being held by the agent, but slicing requires the obje… view at source ↗

**Figure 7.** Figure 7: Demonstration of critic-assisted correction during execution. When the agent fails to execute Slice bread, the Critic inspects the failure, identifies that the bread is not placed on a surface, and proposes a corrective action while updating its experience [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Demonstration of sequential object handling with persistent identity tracking. HoloMind is able to pick up one pen, correctly place it, and then pick up the other, maintaining a stable understanding that the two pens are distinct objects across the entire sequence of actions. accumulation. Such “getting better over time” behavior is often more indicative of long-term utility than a single static measure of… view at source ↗

**Figure 9.** Figure 9: Demonstration of a failure case with unresolved errors. The agent incorrectly places the first book but fails to detect the mistake. When attempting to pick up the second book, the system becomes inconsistent, the Critic cannot recover, and control is returned to the planner, terminating the task. Having outlined the motivation for IR and the key intuition behind its design, we now provide a formal definit… view at source ↗

**Figure 10.** Figure 10: Multimodal evidence used for memory retrieval. The LLM receives visual candidates from CLIP and determines whether each image matches the query description, enabling disambiguation in cluttered or visually similar scenes. we partition the sequence uniformly into k contiguous intervals: Sk,j = st | t ∈ Ik,j , j = 1, . . . , k. (D5) These intervals reflect different temporal resolutions: • Small k: coarse… view at source ↗

**Figure 11.** Figure 11: Power-law curves for IR values from 0 to 2. We numerically determine the exponent a such that st = (t/T) a produces the desired IR value, yielding a continuous sweep of geometric shapes. 0.0 0.2 0.4 0.6 0.8 1.0 Normalized step t/T 0.0 0.2 0.4 0.6 0.8 1.0 N orm aliz e d s c ore st Synthetic Power-Law Score Curves for Different Agents Human (IR=1.78) GPT-5 (IR=1.70) GPT-5-mini (IR=1.16) Qwen3-VL-32B (IR=1.6… view at source ↗

read the original abstract

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LongAct and HoloMind give a workable benchmark and agent for long-horizon household planning, with clear performance gaps even for top models, though the low-level abstraction leaves realism questions open.

read the letter

LongAct and HoloMind stand out as a practical attempt to test planning over many steps in household chores, with the agent cutting down on how much model size matters. The benchmark uses free-form instructions rather than fixed categories, and the agent layers a DAG planner with multimodal spatial memory, episodic memory, and a global critic. This setup lets the system track dependencies and reuse past experience, which shows up in the results as better goal completion than plain VLMs. Even the strongest models reach only 59 percent goal completion and 16 percent full-task success, which gives a useful signal that sustained reasoning remains hard. The paper does a clean job of isolating high-level capabilities and reporting the gains from the added components. The main soft spot is the decision to abstract away low-level control. Real household tasks involve grasp failures, navigation drift, and partial views that trigger immediate re-planning and memory updates. Stripping those out risks testing a cleaner problem than embodied agents actually face, so the reported improvements may not translate directly to physical settings. The abstract also skips detailed task definitions and statistical checks, which makes it harder to assess how stable the comparisons are. This work is aimed at people building planning agents for embodied AI. Readers who care about memory architectures or long-horizon benchmarks will find the task design and the agent breakdown worth examining. It has enough substance and a clear gap to justify sending it to peer review, though the authors should add more on how the abstraction affects realism and include full verification details.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LongAct, a benchmark for long-horizon household task execution that abstracts away embodiment-specific low-level control to isolate high-level capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. It also proposes HoloMind, a VLM-driven agent using a DAG-based hierarchical planner, Multimodal Spatial Memory, Episodic Memory, and a global Critic. Experiments with GPT-5 and Qwen3-VL models report that HoloMind yields substantial gains while reducing reliance on model scale, although absolute performance remains low (59% goal completion, 16% full-task success).

Significance. If the abstraction and results hold, the work addresses a clear gap in embodied AI by providing a benchmark focused on sustained planning rather than short-horizon navigation or manipulation, and demonstrates an agent architecture whose components (DAG planner, persistent multimodal memory, critic) can improve long-horizon performance without larger models. The low absolute success rates supply a concrete, falsifiable signal of remaining challenges in memory and adaptive re-planning.

major comments (3)

[§3] §3 (Benchmark Design): The claim that abstracting low-level control successfully isolates high-level cognitive capabilities without losing essential task realism is load-bearing for the entire evaluation; the manuscript provides no analysis or ablation showing how perfect state observations affect cascading re-planning, dependency re-evaluation, or memory updates relative to realistic execution errors (grasp slippage, navigation drift, partial occlusions).
[§5] §5 (Experiments and Results): The headline metrics (59% goal completion, 16% full-task success) are presented without reported variance across runs, number of trials per task, or statistical tests; this prevents verification that HoloMind's reported gains are robust rather than sensitive to particular task instances or random seeds.
[§4.1] §4.1 (HoloMind Architecture): The global Critic's reflective supervision mechanism and its precise interface with the DAG planner and Multimodal Spatial Memory are described only at a high level; without explicit update rules or pseudocode, it is impossible to determine whether the reported improvements stem from the critic or from other unablated components.

minor comments (2)

[Abstract] The abstract and §4 refer to 'GPT-5'; clarify the exact model version used (e.g., GPT-4o or a hypothetical successor) to avoid confusion.
[Figure 2] Figure 2 (agent overview) would benefit from explicit arrows or labels showing data flow between the Critic, Episodic Memory, and DAG planner.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the benchmark design, experimental reporting, and architectural details. We address each major comment below and will revise the manuscript to incorporate additional analysis, statistical reporting, and implementation specifics where appropriate.

read point-by-point responses

Referee: [§3] §3 (Benchmark Design): The claim that abstracting low-level control successfully isolates high-level cognitive capabilities without losing essential task realism is load-bearing for the entire evaluation; the manuscript provides no analysis or ablation showing how perfect state observations affect cascading re-planning, dependency re-evaluation, or memory updates relative to realistic execution errors (grasp slippage, navigation drift, partial occlusions).

Authors: We agree that the abstraction's validity requires explicit validation against execution noise. In the revised manuscript we will add a dedicated ablation subsection that injects simulated errors (navigation drift, grasp failures, and partial occlusions) into the observation stream and quantifies their downstream effects on re-planning frequency, dependency re-evaluation, and memory-update accuracy. This will clarify the extent to which the reported performance gap is attributable to high-level reasoning versus low-level uncertainty. revision: yes
Referee: [§5] §5 (Experiments and Results): The headline metrics (59% goal completion, 16% full-task success) are presented without reported variance across runs, number of trials per task, or statistical tests; this prevents verification that HoloMind's reported gains are robust rather than sensitive to particular task instances or random seeds.

Authors: We acknowledge the omission of variance and statistical detail. The original experiments used five independent runs per task; we will report mean and standard deviation for all metrics, specify the exact trial count, and include paired t-tests (or Wilcoxon tests where appropriate) comparing HoloMind against baselines to establish statistical significance of the observed gains. revision: yes
Referee: [§4.1] §4.1 (HoloMind Architecture): The global Critic's reflective supervision mechanism and its precise interface with the DAG planner and Multimodal Spatial Memory are described only at a high level; without explicit update rules or pseudocode, it is impossible to determine whether the reported improvements stem from the critic or from other unablated components.

Authors: We agree that the current description is insufficient for reproducibility. In the revision we will expand Section 4.1 with explicit update rules for the Critic, its read/write interface to the DAG planner and Multimodal Spatial Memory, and pseudocode in the appendix. We will also add an ablation that disables the Critic while keeping all other modules fixed, thereby isolating its contribution to the reported performance lift. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces LongAct benchmark by abstracting low-level control to isolate high-level capabilities (instruction understanding, dependency management, memory, adaptive planning) and proposes HoloMind agent with DAG planner, Multimodal Spatial Memory, Episodic Memory, and global Critic. These are presented as independent design choices evaluated via experiments on external models (GPT-5, Qwen3-VL) yielding concrete metrics (59% goal completion, 16% full-task success). No equations, fitted parameters, self-citations, or uniqueness theorems are invoked in a load-bearing way; the derivation from benchmark definition to agent performance does not reduce to inputs by construction and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The work relies on standard assumptions from embodied AI and VLM literature; no free parameters or new physical entities are introduced. The memory modules and critic are architectural inventions without independent falsifiable evidence outside the paper.

invented entities (3)

Multimodal Spatial Memory no independent evidence
purpose: persistent world modeling
New component for maintaining scene state across long horizons
Episodic Memory no independent evidence
purpose: experience reuse
New component for storing and retrieving past successful episodes
global Critic no independent evidence
purpose: reflective supervision
New module for reviewing and improving plans

pith-pipeline@v0.9.0 · 5511 in / 1263 out tokens · 37951 ms · 2026-05-15T01:55:07.807981+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HoloMind employs a hierarchical planning architecture in which a high-level planner decomposes instructions into a directed acyclic graph (DAG) of goals... Multimodal Spatial Memory... Episodic Memory... global Critic for reflective supervision.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Improvement Rate (IR) ... slope of the best linear least-squares fit ... ak = L(αk,1, …, αk,k)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 11 internal anchors

[1]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:1–19, 2023

work page 2023
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 10 Running Title for Header

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Reverie: Remote embodied visual referring expression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020

work page 2020
[8]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

work page 2018
[9]

Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation

Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5543–5550. IEEE, 2024

work page 2024
[10]

Goat-bench: A benchmark for multi-modal lifelong navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16373–16383, 2024

work page 2024
[11]

Soundspaces: Audio-visual navigation in 3d environments

Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. Soundspaces: Audio-visual navigation in 3d environments. InEuropean conference on computer vision, pages 17–36. Springer, 2020

work page 2020
[12]

Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

work page arXiv 2006
[13]

C-NAV: Towards Self-Evolving Continual Object Navigation in Open World

Ming-Ming Yu, Fei Zhu, Wenzhuo Liu, Yirong Yang, Qunbo Wang, Wenjun Wu, and Jing Liu. C-nav: Towards self-evolving continual object navigation in open world.arXiv preprint arXiv:2510.20685, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Robotwin: Dual-arm robot benchmark with generative digital twins (early version)

Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version). InEuropean Conference on Computer Vision, pages 264–273. Springer, 2024

work page 2024
[15]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettle- moyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020

work page 2020
[16]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[17]

Rearrangement: A challenge for embodied ai.arXiv preprint arXiv:2011.01975, 2020

Dhruv Batra, Angel X Chang, Sonia Chernova, Andrew J Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, et al. Rearrangement: A challenge for embodied ai.arXiv preprint arXiv:2011.01975, 2020

work page arXiv 2011
[18]

The threedworld transport challenge: A visually guided task-and-motion planning benchmark towards physically realistic embodied ai

Chuang Gan, Siyuan Zhou, Jeremy Schwartz, Seth Alter, Abhishek Bhandwaldar, Dan Gutfreund, Daniel LK Yamins, James J DiCarlo, Josh McDermott, Antonio Torralba, et al. The threedworld transport challenge: A visually guided task-and-motion planning benchmark towards physically realistic embodied ai. In2022 International conference on robotics and automation...

work page 2022
[19]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Tutorial on directed acyclic graphs.Journal of Clinical Epidemiology, 142:264–267, 2022

Jean C Digitale, Jeffrey N Martin, and Medellena Maria Glymour. Tutorial on directed acyclic graphs.Journal of Clinical Epidemiology, 142:264–267, 2022

work page 2022
[21]

Openai gpt-5 system card, 2025

work page 2025
[22]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

work page 2019
[23]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Watch-and-help: A challenge for social perception and human-ai collaboration.arXiv preprint arXiv:2010.09890, 2020

Xavier Puig, Tianmin Shu, Shuang Li, Zilin Wang, Yuan-Hong Liao, Joshua B Tenenbaum, Sanja Fidler, and Antonio Torralba. Watch-and-help: A challenge for social perception and human-ai collaboration.arXiv preprint arXiv:2010.09890, 2020. 11 Running Title for Header

work page arXiv 2010
[25]

Lota-bench: Benchmarking language-oriented task planners for embodied agents.arXiv preprint arXiv:2402.08178, 2024

Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. Lota-bench: Benchmarking language-oriented task planners for embodied agents.arXiv preprint arXiv:2402.08178, 2024

work page arXiv 2024
[26]

Goat-bench: A benchmark for multi-modal lifelong navigation, 2024

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. Goat-bench: A benchmark for multi-modal lifelong navigation, 2024

work page 2024
[27]

Karen Liu, Jiajun Wu, and Li Fei-Fei

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R ...

work page 2024
[28]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Llm-state: Expandable state representation for long-horizon task planning in the open world.CoRR, 2023

Siwei Chen, Anxing Xiao, and David Hsu. Llm-state: Expandable state representation for long-horizon task planning in the open world.CoRR, 2023

work page 2023
[31]

Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023

Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023

work page arXiv 2023
[32]

Progprompt: Generating situated robot task plans using large language models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022

work page arXiv 2022
[33]

Film: Following instructions in language with modular methods.arXiv preprint arXiv:2110.07342, 2021

So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Following instructions in language with modular methods.arXiv preprint arXiv:2110.07342, 2021

work page arXiv 2021
[34]

Epo: Hierarchical llm agents with environment preference optimization.arXiv preprint arXiv:2408.16090, 2024

Qi Zhao, Haotian Fu, Chen Sun, and George Konidaris. Epo: Hierarchical llm agents with environment preference optimization.arXiv preprint arXiv:2408.16090, 2024

work page arXiv 2024
[35]

Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks

Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Jinrui Liu, Haoran Li, Dongbin Zhao, and He Wang. Robogpt: an llm-based long-term decision-making embodied agent for instruction following tasks. IEEE Transactions on Cognitive and Developmental Systems, 2025

work page 2025
[36]

Llm-planner: Few-shot grounded planning for embodied agents with large language models

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023

work page 2023
[37]

Context-aware planning and environment-aware memory for instruction following embodied agents

Byeonghwi Kim, Jinyeon Kim, Yuyeong Kim, Cheolhong Min, and Jonghyun Choi. Context-aware planning and environment-aware memory for instruction following embodied agents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10936–10946, 2023

work page 2023
[38]

Flowplan: Zero-shot task planning with llm flow engineering for robotic instruction following.arXiv preprint arXiv:2503.02698, 2025

Zijun Lin, Chao Tang, Hanjing Ye, and Hong Zhang. Flowplan: Zero-shot task planning with llm flow engineering for robotic instruction following.arXiv preprint arXiv:2503.02698, 2025

work page arXiv 2025
[39]

Procthor: Large-scale embodied ai using procedural generation, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation, 2022

work page 2022
[40]

Ai2-thor: An interactive 3d environment for visual ai, 2022

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2022

work page 2022
[41]

Visual language maps for robot navigation, 2023

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation, 2023

work page 2023
[42]

Audio visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Audio visual language maps for robot navigation. InProceedings of the International Symposium on Experimental Robotics (ISER), Chiang Mai, Thailand, 2023

work page 2023
[43]

Interactive

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 12 Running Title for Header Appendix A Comparison with...

work page 2021