pith. sign in

hub Canonical reference

Inner Monologue: Embodied Reasoning through Planning with Language Models

Canonical reference. 81% of citing Pith papers cite this work as background.

75 Pith papers citing it
Background 81% of classified citations
abstract

Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, scene description, and human interaction. We find that closed-loop language feedback significantly improves high-level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen environment in the real world.

hub tools

citation-role summary

background 23 baseline 1 dataset 1 extension 1

citation-polarity summary

claims ledger

  • abstract Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choice

co-cited works

clear filters

representative citing papers

State-Centric Decision Process

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

Voyager: An Open-Ended Embodied Agent with Large Language Models

cs.AI · 2023-05-25 · unverdicted · novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

cs.HC · 2026-05-26 · unverdicted · novelty 6.0

Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.

citing papers explorer

Showing 10 of 10 citing papers after filters.

  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models cs.CL · 2023-05-17 · accept · none · ref 14 · internal anchor

    Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

  • Generative Agents: Interactive Simulacra of Human Behavior cs.HC · 2023-04-07 · accept · none · ref 51 · internal anchor

    Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

  • Code as Policies: Language Model Programs for Embodied Control cs.RO · 2022-09-16 · accept · none · ref 18 · internal anchor

    Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

  • From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 63 · internal anchor

    A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

  • LLM+P: Empowering Large Language Models with Optimal Planning Proficiency cs.AI · 2023-04-22 · accept · none · ref 37 · internal anchor

    LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.

  • Do As I Can, Not As I Say: Grounding Language in Robotic Affordances cs.RO · 2022-04-04 · accept · none · ref 25 · internal anchor

    SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.

  • Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success cs.RO · 2025-02-27 · accept · none · ref 17 · internal anchor

    OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.

  • Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 33 · internal anchor

    CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.

  • A Survey on Large Language Model based Autonomous Agents cs.AI · 2023-08-22 · accept · none · ref 62 · internal anchor

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.

  • Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 100 · internal anchor

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.