Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

Zihao Wang , Shaofei Cai , Guanzhou Chen , Anji Liu , Xiaojian Ma , Yitao Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords planplanningtextitunderlinemulti-taskopen-worldagentagents

0 comments

The pith

DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Open-world environments like Minecraft require agents to handle long sequences of actions for goals such as gathering resources or crafting items. Standard planners struggle because they produce rigid plans that do not correct errors well and do not prioritize sub-tasks based on how many steps the current agent needs. DEPS addresses this with four stages: it first describes the intended execution process, then generates a plan, then uses self-explanation to diagnose failures during execution, and finally applies a trained selector that ranks parallel sub-goals by estimated completion steps. The selector is a separate trainable module that learns to predict difficulty. Experiments show this combination allows robust completion of more than 70 distinct tasks in a zero-shot setting and improves results on other simulated environments such as ALFWorld.

Core claim

Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances.

Load-bearing premise

That large language models can reliably produce accurate descriptions and self-explanations of plan failures without hallucination, and that the trained goal selector generalizes to estimate completion steps across diverse tasks.

read the original abstract

We investigate the challenge of task planning for multi-task embodied agents in open-world environments. Two main difficulties are identified: 1) executing plans in an open-world environment (e.g., Minecraft) necessitates accurate and multi-step reasoning due to the long-term nature of tasks, and 2) as vanilla planners do not consider how easy the current agent can achieve a given sub-task when ordering parallel sub-goals within a complicated plan, the resulting plan could be inefficient or even infeasible. To this end, we propose "$\underline{D}$escribe, $\underline{E}$xplain, $\underline{P}$lan and $\underline{S}$elect" ($\textbf{DEPS}$), an interactive planning approach based on Large Language Models (LLMs). DEPS facilitates better error correction on initial LLM-generated $\textit{plan}$ by integrating $\textit{description}$ of the plan execution process and providing self-$\textit{explanation}$ of feedback when encountering failures during the extended planning phases. Furthermore, it includes a goal $\textit{selector}$, which is a trainable module that ranks parallel candidate sub-goals based on the estimated steps of completion, consequently refining the initial plan. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances. Further testing reveals our method's general effectiveness in popularly adopted non-open-ended domains as well (i.e., ALFWorld and tabletop manipulation). The ablation and exploratory studies detail how our design beats the counterparts and provide a promising update on the $\texttt{ObtainDiamond}$ grand challenge with our approach. The code is released at https://github.com/CraftJarvis/MC-Planner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DEPS combines LLM description, explanation, and a trainable selector into an interactive loop that lifts Minecraft task performance, though the gains rest on untested assumptions about LLM reliability.

read the letter

The main takeaway is that DEPS improves LLM-based planning for open-world agents by adding a loop where the model describes the plan, explains execution failures, and a learned selector ranks subgoals by estimated completion steps. This addresses inefficiency in ordering parallel tasks and shows up as better success rates on long-horizon Minecraft jobs plus tests in ALFWorld and tabletop settings. The code release is useful for anyone wanting to reproduce or extend the selector training. What stands out is the practical integration rather than a wholly new theory; it takes prior LLM planning work and adds the feedback and ranking pieces in a way that produces measurable lifts. The ablation mentions suggest they checked component contributions, which helps. The softer areas are the reliance on LLM self-explanations without reported checks for hallucination frequency or accuracy, and the selector's generalization across task distributions is asserted more than demonstrated in the visible details. The bold claim of first zero-shot coverage for 70+ tasks and near-doubling of performance would need tight baselines, error bars, and controls to land cleanly. This is for researchers working on embodied agents or LLM tool use in simulation. It shows honest engagement with the practical bottlenecks and ships enough to be worth referee time. I would send it for peer review.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs provide reliable planning feedback and that a separately trained selector can estimate sub-task difficulty; no new physical entities are introduced.

free parameters (1)

goal selector parameters
The trainable module that ranks sub-goals by estimated completion steps is fitted during training.

axioms (1)

domain assumption Large language models can generate accurate descriptions of plan execution and self-explanations of failures
Invoked throughout the interactive planning loop described in the abstract.

pith-pipeline@v0.9.0 · 5637 in / 1223 out tokens · 218447 ms · 2026-05-16T03:20:17.333742+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 7.0

MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
State-Centric Decision Process
cs.AI 2026-05 unverdicted novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
cs.AI 2026-05 unverdicted novelty 7.0

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
World2Minecraft: Occupancy-Driven Simulated Scenes Construction
cs.CV 2026-04 unverdicted novelty 7.0

World2Minecraft turns real scenes into Minecraft worlds via occupancy prediction and releases a large indoor occupancy dataset to improve such models.
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
cs.AI 2026-04 unverdicted novelty 7.0

Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
cs.AI 2026-04 unverdicted novelty 7.0

GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
cs.AI 2023-05 conditional novelty 6.0

GITM uses LLMs to generate action plans from text knowledge and memory, enabling agents to complete long-horizon Minecraft tasks at much higher success rates than prior RL methods.
PaLM-E: An Embodied Multimodal Language Model
cs.LG 2023-03 conditional novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
Gated Coordination for Efficient Multi-Agent Collaboration in Minecraft Game
cs.MA 2026-04 unverdicted novelty 5.0

Gated escalation and partitioned states enable more efficient multi-agent collaboration in Minecraft by making communication selective rather than automatic.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 19 Pith papers · 16 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Bacon, J

P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, 2017. 1

work page 2017
[3]

Baker, I

B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795, 2022. 7, 8, 9, 10

work page arXiv 2022
[4]

Brohan, Y

A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In 6th Annual Conference on Robot Learning, 2022. 1, 2, 3, 5, 10, 21

work page 2022
[5]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 1, 3, 15, 21, 22

work page 1901
[6]

S. Cai, Z. Wang, X. Ma, A. Liu, and Y . Liang. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. arXiv preprint arXiv:2301.10034, 2023. 6, 8, 10, 21, 24

work page arXiv 2023
[7]

S. Cai, B. Zhang, Z. Wang, X. Ma, A. Liu, and Y . Liang. Groot: Learning to follow instructions by watching gameplay videos. arXiv preprint arXiv:2310.08235, 2023. 10

work page arXiv 2023
[8]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 3, 7, 21, 22

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Dasgupta, C

I. Dasgupta, C. Kaeser-Chen, K. Marino, A. Ahuja, S. Babayan, F. Hill, and R. Fergus. Col- laborating with language models for embodied reasoning. In NeurIPS Foundation Models for Decision Making Workshop, 2022. 9

work page 2022
[10]

L. Fan, G. Wang, Y . Jiang, A. Mandlekar, Y . Yang, H. Zhu, A. Tang, D.-A. Huang, Y . Zhu, and A. Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems Datasets and Benchmarks,

work page
[11]

1, 2, 7, 8, 10, 18, 21, 26

work page
[12]

R. Gong, Q. Huang, X. Ma, H. V o, Z. Durante, Y . Noda, Z. Zheng, S.-C. Zhu, D. Terzopoulos, L. Fei-Fei, et al. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971,

work page arXiv
[13]

W. H. Guss, M. Y . Castro, S. Devlin, B. Houghton, N. S. Kuno, C. Loomis, S. Milani, S. P. Mohanty, K. Nakata, R. Salakhutdinov, J. Schulman, S. Shiroshita, N. Topin, A. Ummadisingu, and O. Vinyals. The minerl 2020 competition on sample efficient reinforcement learning using human priors. arXiv: Learning, 2021. 2

work page 2020
[14]

W. H. Guss, C. Codel, K. Hofmann, B. Houghton, N. Kuno, S. Milani, S. Mohanty, D. P. Liebana, R. Salakhutdinov, N. Topin, et al. Neurips 2019 competition: the minerl competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv:1904.10079, 2019. 2

work page arXiv 2019
[15]

W. H. Guss, B. Houghton, N. Topin, P. Wang, C. Codel, M. Veloso, and R. Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv:1907.13440,

work page arXiv 1907
[16]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 9, 10 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Huang, P

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ICML, 2022. 3, 4, 7, 8, 9, 15, 16, 17, 21, 24, 25, 26, 28

work page 2022
[18]

Inner Monologue: Embodied Reasoning through Planning with Language Models

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022. 2, 3, 5, 7, 8, 10, 15, 16, 17, 24, 25, 26, 30, 34, 35

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Johnson, K

M. Johnson, K. Hofmann, T. Hutton, and D. Bignell. The malmo platform for artificial intelligence experimentation. In Ijcai, pages 4246–4247. Citeseer, 2016. 2, 7

work page 2016
[20]

Kanervisto, S

A. Kanervisto, S. Milani, K. Ramanauskas, N. Topin, Z. Lin, J. Li, J. Shi, D. Ye, Q. Fu, W. Yang, W. Hong, Z. Huang, H. Chen, G. Zeng, Y . Lin, V . Micheli, E. Alonso, F. Fleuret, A. Nikulin, Y . Belousov, O. Svidchenko, and A. Shpilman. Minerl diamond 2021 competition: Overview, results, and lessons learned. neural information processing systems, 2022. 2

work page 2021
[21]

Code as Policies: Language Model Programs for Embodied Control

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

2, 3, 5, 7, 8, 10, 25, 26, 31

work page
[23]

doi:10.48550/arXiv.2306.00937 , urldate =

S. Lifshitz, K. Paster, H. Chan, J. Ba, and S. McIlraith. Steve-1: A generative model for text-to-behavior in minecraft. arXiv preprint arXiv:2306.00937, 2023. 8, 10

work page arXiv 2023
[24]

H. Lin, Z. Wang, J. Ma, and Y . Liang. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023. 6, 7, 8, 18

work page arXiv 2023
[25]

K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023. 10

work page arXiv 2023
[26]

Z. Lin, J. Li, J. Shi, D. Ye, Q. Fu, and W. Yang. Juewu-mc: Playing minecraft with sample- efficient hierarchical reinforcement learning. arXiv preprint arXiv:2112.04907, 2021. 10

work page arXiv 2021
[27]

X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474, 2022. 1

work page arXiv 2022
[28]

J. Mai, J. Chen, B. Li, G. Qian, M. Elhoseiny, and B. Ghanem. Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349, 2023. 10

work page arXiv 2023
[29]

H. Mao, C. Wang, X. Hao, Y . Mao, Y . Lu, C. Wu, J. Hao, D. Li, and P. Tang. Seihai: A sample- efficient hierarchical ai for the minerl competition. In Distributed Artificial Intelligence: Third International Conference, DAI 2021, Shanghai, China, December 17–18, 2021, Proceedings 3, pages 38–51. Springer, 2022. 10

work page 2021
[30]

S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022. 23

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 2

work page internal anchor Pith review Pith/arXiv arXiv 2013
[32]

J. Oh, S. Singh, H. Lee, and P. Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pages 2661–2670. PMLR, 2017. 10

work page 2017
[33]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. 21, 22

work page 2023
[34]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. 3, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

V . P. Patil, M. Hofmarcher, M.-C. Dinu, M. Dorfer, P. M. Blies, J. Brandstetter, J. A. Arjona- Medina, and S. Hochreiter. Align-rudder: Learning from few demonstrations by reward redistribution. ICML, 2020. 9

work page 2020
[37]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 5, 6, 8 12

work page 2021
[38]

S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019. 23

work page internal anchor Pith review Pith/arXiv arXiv 1908
[40]

T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipula- tion. In Conference on Robot Learning. PMLR, 2022. 2, 3, 15, 16, 17, 23

work page 2022
[43]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

M. Shridhar, X. Yuan, M.-A. Côté, Y . Bisk, A. Trischler, and M. Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[44]

Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022. 3, 4, 7, 8, 10, 25, 26, 28

work page arXiv 2022
[45]

Skrynnik, A

A. Skrynnik, A. Staroverov, E. Aitygulov, K. Aksenov, V . Davydov, and A. I. Panov. For- getful experience replay in hierarchical reinforcement learning from expert demonstrations. Knowledge-Based Systems, 218:106844, 2021. 9

work page 2021
[46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 22

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), 2022. 5, 7, 8, 15, 25, 26, 29

work page 2022
[48]

A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 2, 9

work page arXiv 2022
[49]

Zhang, K

C. Zhang, K. Yang, S. Hu, Z. Wang, G. Li, Y . Sun, C. Zhang, Z. Zhang, A. Liu, S.-C. Zhu, et al. Proagent: Building proactive cooperative ai with large language models. arXiv preprint arXiv:2308.11339, 2023. 10

work page arXiv 2023
[50]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[51]

putting the blocks in the bowls with matching colors

10 13 Appendix Contents A Additional Experiments 15 A.1 ALFWorld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.1.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Tabletop Manipulation . . . ....

work page
[52]

I’m not at diningtable 1

look AI: The goal is to pick a plate at diningtable 1. I’m not at diningtable 1. So I should go to diningtable 1 first. So the answer is 2. Human: You arrive at loc 17. On the diningtable 1, you see a plate 2, and a tomato 2. The 7 admissible actions of the current situation are :

work page
[55]

I’m at diningtable 1, and there is a plate 2

take tomato 2 from diningtable 1 AI: The goal is to pick a plate at diningtable 1. I’m at diningtable 1, and there is a plate 2. So I should take plate 2 from diningtable 1. So the answer is 6. Human: You pick up the plate 2 from the diningtable 1. The 6 admissible actions of the current situation are :

work page
[57]

According to Human, I pick up the plate 2 from diningtable 1

put plate 2 in/on diningtable 1 AI: The goal is to pick a plate at diningtable 1. According to Human, I pick up the plate 2 from diningtable 1. So the goal is accomplished. Human: You are facing fridge 1. You are carrying plate 2. Your goal is to: cool a plate at fridge 1. The 6 admissible actions of the current situation are :

work page
[58]

I’m at fridge 1

cool plate 2 with fridge 1 AI: The goal is to cool a plate at fridge 1. I’m at fridge 1. So I should cool plate 2 with fridge 1. So the answer is 6. Human: You cool the plate 2 using the fridge 1. The 5 admissible actions of the current situation are :

work page
[59]

According to Human, I cool the plate 2 using the fridge 1

look AI: The goal is to cool a plate at fridge 1. According to Human, I cool the plate 2 using the fridge 1. So the goal is accomplished. Human: You are facing diningtable 1. You are not carrying anything. Your goal is to: pick a plate at diningtable 1. The 6 admissible actions of the current situation are :

work page
[61]

I’m at diningtable 1, but there is no action to pick a plate

take tomato 2 from diningtable 1 AI: The goal is to pick a plate at diningtable 1. I’m at diningtable 1, but there is no action to pick a plate. So the answer is 0(none). So the goal can’t be accomplished, probably because there is no plate at diningtable 1. 32 Human: You are facing fridge 1. You are not carrying anything. Your goal is to: cool a plate at...

work page
[62]

I’m at fridge 1, but there is no action to cool a plate

look AI: The goal is to cool a plate at fridge 1. I’m at fridge 1, but there is no action to cool a plate. So the answer is 0(none). So the goal can’t be accomplished, probably because I’m not carrying a plate. Human: You are facing fridge 1. You are carrying a plate. Your goal is to: pick a tomato at fridge 1. The 5 admissible actions of the current situ...

work page
[63]

I’m at fridge 1, but there is no action to pick a tomato

look AI: The goal is to pick a tomato at fridge 1. I’m at fridge 1, but there is no action to pick a tomato. So the answer is 0(none). So the goal can’t be accomplished, probably because I can’t pick a tomato while carrying a plate. Listing 10: Prompt for the controller of DEP in ALFWorld tasks G.2.2 Baselines Human: -= Welcome to TextWorld, ALFRED! =- Yo...

work page
[64]

I’m carrying nothing

look Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying nothing. So I need to find a plate first. There are cabinet, diningtable and fridge, but no plate. I’m most likely to find a plate near diningtable. So I should go to diningtable 1. So the answer is 2. Robot Action : go to diningtable 1 Scene: You arrive at loc 17. On ...

work page
[66]

take plate 2 from diningtable 1

work page
[67]

I’m carrying nothing

take tomato 2 from diningtable 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying nothing. So I need to find a plate first. There is a plate 2. So I should take plate 2 from diningtable 1. So the answer is 6. Robot Action : take plate 2 from diningtable 1 Scene: You pick up the plate 2 from the diningtable 1. The 7 admiss...

work page
[68]

examine diningtable 1

work page
[69]

I’m carrying a plate 2

put plate 2 in/on diningtable 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a plate 2. So I need to cool a plate first. There are cabinet, diningtable, and fridge. I’m most likely to cool a plate near fridge. So I should go to fridge 1. So the answer is 4. Robot Action : go to fridge 1 Scene: You arrive at loc 41. Th...

work page
[71]

I’m carrying a plate 2

open fridge 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a plate 2. So I need to cool a plate first. There is a fridge 1. So I should cool plate 2 with fridge 1. So the answer is 1. Robot Action : cool plate 2 with fridge 1 Scene: You cool the plate 2 using the fridge 1. The 8 admissible actions of the current situa...

work page
[72]

cool plate 2 with fridge 1

work page
[73]

I’m carrying a cool plate 2

open fridge 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a cool plate 2. So I need to find a cabinet first. There is a cabinet 1. So I should go to cabinet 1. So the answer is 4. Robot Action : go to cabinet 1 Scene: You arrive at loc 20. The cabinet 1 is closed. The 7 admissible actions of the current situation are :

work page
[74]

I’m carrying a cool plate 2

open cabinet 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a cool plate 2. So I need to find a cabinet first. There is a cabinet 1. The cabinet is closed. So I should open cabinet 1. So the answer is 7. Robot Action : open cabinet 1 Scene: You open the cabinet 1. The cabinet 1 is open. In it, you see nothing. The 8 a...

work page
[75]

put the green blocks in a blue bowl

put plate 2 in/on cabinet 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a cool plate 2. So I need to find a cabinet first. There is a cabinet 1. So I should put plate 2 in/on cabinet 1. So the answer is 8. Robot Action : put plate 2 in/on cabinet 1 Scene: You put the plate 2 in/on the cabinet 1. You won!!! Listing 11...

work page
[76]

put the yellow blocks in a yellow bowl

work page
[77]

put the green blocks in a blue bowl

put the blue blocks in a blue bowl Listing 12: Prompt for DEP in CLIPort tasks G.3.2 Baselines system: The template of the language goal is as follows: put the <color> blocks in a <color> bowl. For example, "put the green blocks in a blue bowl" is a valid lang goal. Do not provide any additional explanations or instructions beyond writing lang goals and d...

work page