pith. machine review for the scientific record. sign in

arxiv: 2302.01560 · v3 · submitted 2023-02-03 · 💻 cs.AI

Recognition: no theorem link

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords planplanningtextitunderlinemulti-taskopen-worldagentagents
0
0 comments X

The pith

DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Open-world environments like Minecraft require agents to handle long sequences of actions for goals such as gathering resources or crafting items. Standard planners struggle because they produce rigid plans that do not correct errors well and do not prioritize sub-tasks based on how many steps the current agent needs. DEPS addresses this with four stages: it first describes the intended execution process, then generates a plan, then uses self-explanation to diagnose failures during execution, and finally applies a trained selector that ranks parallel sub-goals by estimated completion steps. The selector is a separate trainable module that learns to predict difficulty. Experiments show this combination allows robust completion of more than 70 distinct tasks in a zero-shot setting and improves results on other simulated environments such as ALFWorld.

Core claim

Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances.

Load-bearing premise

That large language models can reliably produce accurate descriptions and self-explanations of plan failures without hallucination, and that the trained goal selector generalizes to estimate completion steps across diverse tasks.

read the original abstract

We investigate the challenge of task planning for multi-task embodied agents in open-world environments. Two main difficulties are identified: 1) executing plans in an open-world environment (e.g., Minecraft) necessitates accurate and multi-step reasoning due to the long-term nature of tasks, and 2) as vanilla planners do not consider how easy the current agent can achieve a given sub-task when ordering parallel sub-goals within a complicated plan, the resulting plan could be inefficient or even infeasible. To this end, we propose "$\underline{D}$escribe, $\underline{E}$xplain, $\underline{P}$lan and $\underline{S}$elect" ($\textbf{DEPS}$), an interactive planning approach based on Large Language Models (LLMs). DEPS facilitates better error correction on initial LLM-generated $\textit{plan}$ by integrating $\textit{description}$ of the plan execution process and providing self-$\textit{explanation}$ of feedback when encountering failures during the extended planning phases. Furthermore, it includes a goal $\textit{selector}$, which is a trainable module that ranks parallel candidate sub-goals based on the estimated steps of completion, consequently refining the initial plan. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances. Further testing reveals our method's general effectiveness in popularly adopted non-open-ended domains as well (i.e., ALFWorld and tabletop manipulation). The ablation and exploratory studies detail how our design beats the counterparts and provide a promising update on the $\texttt{ObtainDiamond}$ grand challenge with our approach. The code is released at https://github.com/CraftJarvis/MC-Planner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs provide reliable planning feedback and that a separately trained selector can estimate sub-task difficulty; no new physical entities are introduced.

free parameters (1)
  • goal selector parameters
    The trainable module that ranks sub-goals by estimated completion steps is fitted during training.
axioms (1)
  • domain assumption Large language models can generate accurate descriptions of plan execution and self-explanations of failures
    Invoked throughout the interactive planning loop described in the abstract.

pith-pipeline@v0.9.0 · 5637 in / 1223 out tokens · 218447 ms · 2026-05-16T03:20:17.333742+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.

  2. State-Centric Decision Process

    cs.AI 2026-05 unverdicted novelty 7.0

    SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.

  3. Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

    cs.AI 2026-05 unverdicted novelty 7.0

    VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.

  4. World2Minecraft: Occupancy-Driven Simulated Scenes Construction

    cs.CV 2026-04 unverdicted novelty 7.0

    World2Minecraft turns real scenes into Minecraft worlds via occupancy prediction and releases a large indoor occupancy dataset to improve such models.

  5. How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

    cs.AI 2026-04 unverdicted novelty 7.0

    Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.

  6. GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

    cs.AI 2026-04 unverdicted novelty 7.0

    GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.

  7. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  8. Learning Interactive Real-World Simulators

    cs.AI 2023-10 conditional novelty 7.0

    UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

  9. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  10. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    cs.AI 2023-04 accept novelty 7.0

    LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.

  11. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.

  12. From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...

  13. SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    cs.CR 2026-02 unverdicted novelty 6.0

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  14. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  15. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  16. Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

    cs.AI 2023-05 conditional novelty 6.0

    GITM uses LLMs to generate action plans from text knowledge and memory, enabling agents to complete long-horizon Minecraft tasks at much higher success rates than prior RL methods.

  17. PaLM-E: An Embodied Multimodal Language Model

    cs.LG 2023-03 conditional novelty 6.0

    PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...

  18. Gated Coordination for Efficient Multi-Agent Collaboration in Minecraft Game

    cs.MA 2026-04 unverdicted novelty 5.0

    Gated escalation and partitioned states enable more efficient multi-agent collaboration in Minecraft by making communication selective rather than automatic.

  19. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  20. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 19 Pith papers · 16 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 1

  2. [2]

    Bacon, J

    P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, 2017. 1

  3. [3]

    Baker, I

    B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795, 2022. 7, 8, 9, 10

  4. [4]

    Brohan, Y

    A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In 6th Annual Conference on Robot Learning, 2022. 1, 2, 3, 5, 10, 21

  5. [5]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 1, 3, 15, 21, 22

  6. [6]

    S. Cai, Z. Wang, X. Ma, A. Liu, and Y . Liang. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. arXiv preprint arXiv:2301.10034, 2023. 6, 8, 10, 21, 24

  7. [7]

    S. Cai, B. Zhang, Z. Wang, X. Ma, A. Liu, and Y . Liang. Groot: Learning to follow instructions by watching gameplay videos. arXiv preprint arXiv:2310.08235, 2023. 10

  8. [8]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 3, 7, 21, 22

  9. [9]

    Dasgupta, C

    I. Dasgupta, C. Kaeser-Chen, K. Marino, A. Ahuja, S. Babayan, F. Hill, and R. Fergus. Col- laborating with language models for embodied reasoning. In NeurIPS Foundation Models for Decision Making Workshop, 2022. 9

  10. [10]

    L. Fan, G. Wang, Y . Jiang, A. Mandlekar, Y . Yang, H. Zhu, A. Tang, D.-A. Huang, Y . Zhu, and A. Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems Datasets and Benchmarks,

  11. [11]

    1, 2, 7, 8, 10, 18, 21, 26

  12. [12]

    R. Gong, Q. Huang, X. Ma, H. V o, Z. Durante, Y . Noda, Z. Zheng, S.-C. Zhu, D. Terzopoulos, L. Fei-Fei, et al. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971,

  13. [13]

    W. H. Guss, M. Y . Castro, S. Devlin, B. Houghton, N. S. Kuno, C. Loomis, S. Milani, S. P. Mohanty, K. Nakata, R. Salakhutdinov, J. Schulman, S. Shiroshita, N. Topin, A. Ummadisingu, and O. Vinyals. The minerl 2020 competition on sample efficient reinforcement learning using human priors. arXiv: Learning, 2021. 2

  14. [14]

    W. H. Guss, C. Codel, K. Hofmann, B. Houghton, N. Kuno, S. Milani, S. Mohanty, D. P. Liebana, R. Salakhutdinov, N. Topin, et al. Neurips 2019 competition: the minerl competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv:1904.10079, 2019. 2

  15. [15]

    W. H. Guss, B. Houghton, N. Topin, P. Wang, C. Codel, M. Veloso, and R. Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv:1907.13440,

  16. [16]

    Mastering Diverse Domains through World Models

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. 9, 10 11

  17. [17]

    Huang, P

    W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ICML, 2022. 3, 4, 7, 8, 9, 15, 16, 17, 21, 24, 25, 26, 28

  18. [18]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022. 2, 3, 5, 7, 8, 10, 15, 16, 17, 24, 25, 26, 30, 34, 35

  19. [19]

    Johnson, K

    M. Johnson, K. Hofmann, T. Hutton, and D. Bignell. The malmo platform for artificial intelligence experimentation. In Ijcai, pages 4246–4247. Citeseer, 2016. 2, 7

  20. [20]

    Kanervisto, S

    A. Kanervisto, S. Milani, K. Ramanauskas, N. Topin, Z. Lin, J. Li, J. Shi, D. Ye, Q. Fu, W. Yang, W. Hong, Z. Huang, H. Chen, G. Zeng, Y . Lin, V . Micheli, E. Alonso, F. Fleuret, A. Nikulin, Y . Belousov, O. Svidchenko, and A. Shpilman. Minerl diamond 2021 competition: Overview, results, and lessons learned. neural information processing systems, 2022. 2

  21. [21]

    Code as Policies: Language Model Programs for Embodied Control

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753,

  22. [22]

    2, 3, 5, 7, 8, 10, 25, 26, 31

  23. [23]

    doi:10.48550/arXiv.2306.00937 , urldate =

    S. Lifshitz, K. Paster, H. Chan, J. Ba, and S. McIlraith. Steve-1: A generative model for text-to-behavior in minecraft. arXiv preprint arXiv:2306.00937, 2023. 8, 10

  24. [24]

    H. Lin, Z. Wang, J. Ma, and Y . Liang. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023. 6, 7, 8, 18

  25. [25]

    K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023. 10

  26. [26]

    Z. Lin, J. Li, J. Shi, D. Ye, Q. Fu, and W. Yang. Juewu-mc: Playing minecraft with sample- efficient hierarchical reinforcement learning. arXiv preprint arXiv:2112.04907, 2021. 10

  27. [27]

    X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474, 2022. 1

  28. [28]

    J. Mai, J. Chen, B. Li, G. Qian, M. Elhoseiny, and B. Ghanem. Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349, 2023. 10

  29. [29]

    H. Mao, C. Wang, X. Hao, Y . Mao, Y . Lu, C. Wu, J. Hao, D. Li, and P. Tang. Seihai: A sample- efficient hierarchical ai for the minerl competition. In Distributed Artificial Intelligence: Third International Conference, DAI 2021, Shanghai, China, December 17–18, 2021, Proceedings 3, pages 38–51. Springer, 2022. 10

  30. [30]

    S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022. 23

  31. [31]

    V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 2

  32. [32]

    J. Oh, S. Singh, H. Lee, and P. Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pages 2661–2670. PMLR, 2017. 10

  33. [33]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023. 21, 22

  34. [34]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. 3, 5, 7, 8

  35. [35]

    J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023. 10

  36. [36]

    V . P. Patil, M. Hofmarcher, M.-C. Dinu, M. Dorfer, P. M. Blies, J. Brandstetter, J. A. Arjona- Medina, and S. Hochreiter. Align-rudder: Learning from few demonstrations by reward redistribution. ICML, 2020. 9

  37. [37]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 5, 6, 8 12

  38. [38]

    S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175,

  39. [39]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019. 23

  40. [40]

    T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022. 10

  41. [41]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023. 10

  42. [42]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipula- tion. In Conference on Robot Learning. PMLR, 2022. 2, 3, 15, 16, 17, 23

  43. [43]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    M. Shridhar, X. Yuan, M.-A. Côté, Y . Bisk, A. Trischler, and M. Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768,

  44. [44]

    Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

    I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022. 3, 4, 7, 8, 10, 25, 26, 28

  45. [45]

    Skrynnik, A

    A. Skrynnik, A. Staroverov, E. Aitygulov, K. Aksenov, V . Davydov, and A. I. Panov. For- getful experience replay in hierarchical reinforcement learning from expert demonstrations. Knowledge-Based Systems, 218:106844, 2021. 9

  46. [46]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 22

  47. [47]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), 2022. 5, 7, 8, 15, 25, 26, 29

  48. [48]

    A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 2, 9

  49. [49]

    Zhang, K

    C. Zhang, K. Yang, S. Hu, Z. Wang, G. Li, Y . Sun, C. Zhang, Z. Zhang, A. Liu, S.-C. Zhu, et al. Proagent: Building proactive cooperative ai with large language models. arXiv preprint arXiv:2308.11339, 2023. 10

  50. [50]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

  51. [51]

    putting the blocks in the bowls with matching colors

    10 13 Appendix Contents A Additional Experiments 15 A.1 ALFWorld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.1.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Tabletop Manipulation . . . ....

  52. [52]

    I’m not at diningtable 1

    look AI: The goal is to pick a plate at diningtable 1. I’m not at diningtable 1. So I should go to diningtable 1 first. So the answer is 2. Human: You arrive at loc 17. On the diningtable 1, you see a plate 2, and a tomato 2. The 7 admissible actions of the current situation are :

  53. [55]

    I’m at diningtable 1, and there is a plate 2

    take tomato 2 from diningtable 1 AI: The goal is to pick a plate at diningtable 1. I’m at diningtable 1, and there is a plate 2. So I should take plate 2 from diningtable 1. So the answer is 6. Human: You pick up the plate 2 from the diningtable 1. The 6 admissible actions of the current situation are :

  54. [57]

    According to Human, I pick up the plate 2 from diningtable 1

    put plate 2 in/on diningtable 1 AI: The goal is to pick a plate at diningtable 1. According to Human, I pick up the plate 2 from diningtable 1. So the goal is accomplished. Human: You are facing fridge 1. You are carrying plate 2. Your goal is to: cool a plate at fridge 1. The 6 admissible actions of the current situation are :

  55. [58]

    I’m at fridge 1

    cool plate 2 with fridge 1 AI: The goal is to cool a plate at fridge 1. I’m at fridge 1. So I should cool plate 2 with fridge 1. So the answer is 6. Human: You cool the plate 2 using the fridge 1. The 5 admissible actions of the current situation are :

  56. [59]

    According to Human, I cool the plate 2 using the fridge 1

    look AI: The goal is to cool a plate at fridge 1. According to Human, I cool the plate 2 using the fridge 1. So the goal is accomplished. Human: You are facing diningtable 1. You are not carrying anything. Your goal is to: pick a plate at diningtable 1. The 6 admissible actions of the current situation are :

  57. [61]

    I’m at diningtable 1, but there is no action to pick a plate

    take tomato 2 from diningtable 1 AI: The goal is to pick a plate at diningtable 1. I’m at diningtable 1, but there is no action to pick a plate. So the answer is 0(none). So the goal can’t be accomplished, probably because there is no plate at diningtable 1. 32 Human: You are facing fridge 1. You are not carrying anything. Your goal is to: cool a plate at...

  58. [62]

    I’m at fridge 1, but there is no action to cool a plate

    look AI: The goal is to cool a plate at fridge 1. I’m at fridge 1, but there is no action to cool a plate. So the answer is 0(none). So the goal can’t be accomplished, probably because I’m not carrying a plate. Human: You are facing fridge 1. You are carrying a plate. Your goal is to: pick a tomato at fridge 1. The 5 admissible actions of the current situ...

  59. [63]

    I’m at fridge 1, but there is no action to pick a tomato

    look AI: The goal is to pick a tomato at fridge 1. I’m at fridge 1, but there is no action to pick a tomato. So the answer is 0(none). So the goal can’t be accomplished, probably because I can’t pick a tomato while carrying a plate. Listing 10: Prompt for the controller of DEP in ALFWorld tasks G.2.2 Baselines Human: -= Welcome to TextWorld, ALFRED! =- Yo...

  60. [64]

    I’m carrying nothing

    look Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying nothing. So I need to find a plate first. There are cabinet, diningtable and fridge, but no plate. I’m most likely to find a plate near diningtable. So I should go to diningtable 1. So the answer is 2. Robot Action : go to diningtable 1 Scene: You arrive at loc 17. On ...

  61. [66]

    take plate 2 from diningtable 1

  62. [67]

    I’m carrying nothing

    take tomato 2 from diningtable 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying nothing. So I need to find a plate first. There is a plate 2. So I should take plate 2 from diningtable 1. So the answer is 6. Robot Action : take plate 2 from diningtable 1 Scene: You pick up the plate 2 from the diningtable 1. The 7 admiss...

  63. [68]

    examine diningtable 1

  64. [69]

    I’m carrying a plate 2

    put plate 2 in/on diningtable 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a plate 2. So I need to cool a plate first. There are cabinet, diningtable, and fridge. I’m most likely to cool a plate near fridge. So I should go to fridge 1. So the answer is 4. Robot Action : go to fridge 1 Scene: You arrive at loc 41. Th...

  65. [71]

    I’m carrying a plate 2

    open fridge 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a plate 2. So I need to cool a plate first. There is a fridge 1. So I should cool plate 2 with fridge 1. So the answer is 1. Robot Action : cool plate 2 with fridge 1 Scene: You cool the plate 2 using the fridge 1. The 8 admissible actions of the current situa...

  66. [72]

    cool plate 2 with fridge 1

  67. [73]

    I’m carrying a cool plate 2

    open fridge 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a cool plate 2. So I need to find a cabinet first. There is a cabinet 1. So I should go to cabinet 1. So the answer is 4. Robot Action : go to cabinet 1 Scene: You arrive at loc 20. The cabinet 1 is closed. The 7 admissible actions of the current situation are :

  68. [74]

    I’m carrying a cool plate 2

    open cabinet 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a cool plate 2. So I need to find a cabinet first. There is a cabinet 1. The cabinet is closed. So I should open cabinet 1. So the answer is 7. Robot Action : open cabinet 1 Scene: You open the cabinet 1. The cabinet 1 is open. In it, you see nothing. The 8 a...

  69. [75]

    put the green blocks in a blue bowl

    put plate 2 in/on cabinet 1 Robot Thought : The task is to cool some plate and put it in cabinet. I’m carrying a cool plate 2. So I need to find a cabinet first. There is a cabinet 1. So I should put plate 2 in/on cabinet 1. So the answer is 8. Robot Action : put plate 2 in/on cabinet 1 Scene: You put the plate 2 in/on the cabinet 1. You won!!! Listing 11...

  70. [76]

    put the yellow blocks in a yellow bowl

  71. [77]

    put the green blocks in a blue bowl

    put the blue blocks in a blue bowl Listing 12: Prompt for DEP in CLIPort tasks G.3.2 Baselines system: The template of the language goal is as follows: put the <color> blocks in a <color> bowl. For example, "put the green blocks in a blue bowl" is a valid lang goal. Do not provide any additional explanations or instructions beyond writing lang goals and d...