{"work":{"id":"f6e5e4a1-e34b-4602-a7ad-df0c6103a4d0","openalex_id":null,"doi":null,"arxiv_id":"2207.05608","raw_key":null,"title":"Inner Monologue: Embodied Reasoning through Planning with Language Models","authors":null,"authors_text":"Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence","year":2022,"venue":"cs.RO","abstract":"Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, scene description, and human interaction. We find that closed-loop language feedback significantly improves high-level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen environment in the real world.","external_url":"https://arxiv.org/abs/2207.05608","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T16:23:39.085084+00:00","pith_arxiv_id":"2207.05608","created_at":"2026-05-08T18:28:58.388919+00:00","updated_at":"2026-06-29T16:23:39.085084+00:00","title_quality_ok":true,"display_title":"Inner Monologue: Embodied Reasoning through Planning with Language Models","render_title":"Inner Monologue: Embodied Reasoning through Planning with Language Models"},"hub":{"state":{"work_id":"f6e5e4a1-e34b-4602-a7ad-df0c6103a4d0","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":75,"external_cited_by_count":null,"distinct_field_count":10,"first_pith_cited_at":"2022-04-04T17:57:11+00:00","last_pith_cited_at":"2026-05-29T18:16:51+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T18:19:06.100995+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":23},{"context_role":"baseline","n":1},{"context_role":"dataset","n":1},{"context_role":"extension","n":1}],"polarity_counts":[{"context_polarity":"background","n":21},{"context_polarity":"baseline","n":1},{"context_polarity":"extend","n":1},{"context_polarity":"support","n":1},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:10:15.633864+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","shared_citers":13},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":9},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":7},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":7},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":6},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":6},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":6},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":5},{"title":"Liang, W","work_id":"2d96fa84-8097-493c-b5b8-379c984a5047","shared_citers":5},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":5},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":5},{"title":"AI2-THOR: An Interactive 3D Environment for Visual AI","work_id":"9c86ed28-ea70-424c-bd56-34f59dcad861","shared_citers":4},{"title":"Emergent Abilities of Large Language Models","work_id":"6ea3375b-837c-4640-a175-be7525aa3c6d","shared_citers":4},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":4},{"title":"Language models as zero-shot planners: Extracting actionable knowledge for embodied agents","work_id":"e2e590a8-7af9-4ab9-b5be-29c703ef90fb","shared_citers":4},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":4},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":4},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":4},{"title":"Singh, V","work_id":"3b97ea38-e7d6-4a6b-9857-ef7e271dc3d2","shared_citers":4},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":4},{"title":"The Rise and Potential of Large Language Model Based Agents: A Survey","work_id":"985ca219-7e34-4c4f-bdc5-ccd39763ad61","shared_citers":4},{"title":"Toolformer: Language Models Can Teach Themselves to Use Tools","work_id":"9bce40c8-cfd7-4983-80e0-c3bd4402322a","shared_citers":4}],"time_series":[{"n":1,"year":2022},{"n":6,"year":2023},{"n":1,"year":2024},{"n":1,"year":2025},{"n":26,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:10:11.869441+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:10:11.832387+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Inner Monologue: Embodied Reasoning through Planning with Language Models","claims":[{"claim_text":"Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choice","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Inner Monologue: Embodied Reasoning through Planning with Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:09:31.000194+00:00"}},"summary":{"title":"Inner Monologue: Embodied Reasoning through Planning with Language Models","claims":[{"claim_text":"Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choice","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Inner Monologue: Embodied Reasoning through Planning with Language Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","shared_citers":13},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":9},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":7},{"title":"PaLM-E: An Embodied Multimodal Language Model","work_id":"5b99811a-1d93-47e2-9d59-f4045a0b74a2","shared_citers":7},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":6},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":6},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":6},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":6},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":6},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":5},{"title":"Liang, W","work_id":"2d96fa84-8097-493c-b5b8-379c984a5047","shared_citers":5},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":5},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":5},{"title":"AI2-THOR: An Interactive 3D Environment for Visual AI","work_id":"9c86ed28-ea70-424c-bd56-34f59dcad861","shared_citers":4},{"title":"Emergent Abilities of Large Language Models","work_id":"6ea3375b-837c-4640-a175-be7525aa3c6d","shared_citers":4},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":4},{"title":"Language models as zero-shot planners: Extracting actionable knowledge for embodied agents","work_id":"e2e590a8-7af9-4ab9-b5be-29c703ef90fb","shared_citers":4},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":4},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":4},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":4},{"title":"Singh, V","work_id":"3b97ea38-e7d6-4a6b-9857-ef7e271dc3d2","shared_citers":4},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":4},{"title":"The Rise and Potential of Large Language Model Based Agents: A Survey","work_id":"985ca219-7e34-4c4f-bdc5-ccd39763ad61","shared_citers":4},{"title":"Toolformer: Language Models Can Teach Themselves to Use Tools","work_id":"9bce40c8-cfd7-4983-80e0-c3bd4402322a","shared_citers":4}],"time_series":[{"n":1,"year":2022},{"n":6,"year":2023},{"n":1,"year":2024},{"n":1,"year":2025},{"n":26,"year":2026}],"dependency_candidates":[]},"authors":[]}}