arxiv: 2209.07753 · v4 · submitted 2022-09-16 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Code as Policies: Language Model Programs for Embodied Control

Jacky Liang , Wenlong Huang , Fei Xia , Peng Xu , Karol Hausman , Brian Ichter , Pete Florence , Andy Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:34 UTC · model grok-4.3

classification 💻 cs.RO

keywords language modelsrobot policiescode generationfew-shot promptingembodied controlcontrol primitiveshierarchical promptingspatial reasoning

0 comments

The pith

Language models write executable robot policies by composing code from a few example commands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models trained on code completion can be repurposed to generate robot control policies directly from natural language instructions. By providing a handful of example commands paired with corresponding policy code that calls perception functions and control APIs, the model produces new code for unseen tasks. These policies chain logic, reference libraries for arithmetic and geometry, and resolve vague instructions into precise actions using context. If correct, this shifts robot programming from manual scripting toward conversational specification that works across different robot platforms and tasks.

Core claim

When provided as input several example language commands formatted as comments followed by corresponding policy code via few-shot prompting, LLMs can take in new commands and autonomously re-compose API calls to generate new policy code that exhibits spatial-geometric reasoning, generalizes to new instructions, and prescribes precise values to ambiguous descriptions depending on context.

What carries the argument

Hierarchical code generation through recursive prompting, where the model defines undefined functions on the fly to build complex policies that process perception outputs and parameterize control primitives.

If this is right

Policies gain spatial-geometric reasoning by chaining classic logic and referencing libraries such as NumPy and Shapely.
Generated policies generalize to new instructions without additional training or fine-tuning.
Vague language like 'faster' is turned into concrete parameter values using behavioral commonsense encoded in the model.
The same prompting approach raises state-of-the-art performance on the HumanEval code benchmark to 39.8 percent.
The formulation supports both reactive policies such as impedance controllers and waypoint-based policies such as pick-and-place.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could allow rapid adaptation of robot behavior across different hardware by swapping only the low-level API definitions while keeping the high-level prompt structure fixed.
Safety-critical applications would likely require an added runtime monitor layer because the paper's core claim assumes flawless first-try execution.
Extending the recursive function definition pattern to multi-robot coordination or long-horizon tasks remains an open direction not tested in the current experiments.

Load-bearing premise

The code produced by the language model will execute correctly and safely on physical robots for novel commands without runtime errors or the need for extra verification.

What would settle it

Running the model on a new instruction such as 'move the mug faster toward the target while avoiding the obstacle' and observing that the generated code either crashes, produces unsafe velocities, or fails to complete the motion on the robot.

read the original abstract

Large language models (LLMs) trained on code completion have been shown to be capable of synthesizing simple Python programs from docstrings [1]. We find that these code-writing LLMs can be re-purposed to write robot policy code, given natural language commands. Specifically, policy code can express functions or feedback loops that process perception outputs (e.g.,from object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example language commands (formatted as comments) followed by corresponding policy code (via few-shot prompting), LLMs can take in new commands and autonomously re-compose API calls to generate new policy code respectively. By chaining classic logic structures and referencing third-party libraries (e.g., NumPy, Shapely) to perform arithmetic, LLMs used in this way can write robot policies that (i) exhibit spatial-geometric reasoning, (ii) generalize to new instructions, and (iii) prescribe precise values (e.g., velocities) to ambiguous descriptions ("faster") depending on context (i.e., behavioral commonsense). This paper presents code as policies: a robot-centric formulation of language model generated programs (LMPs) that can represent reactive policies (e.g., impedance controllers), as well as waypoint-based policies (vision-based pick and place, trajectory-based control), demonstrated across multiple real robot platforms. Central to our approach is prompting hierarchical code-gen (recursively defining undefined functions), which can write more complex code and also improves state-of-the-art to solve 39.8% of problems on the HumanEval [1] benchmark. Code and videos are available at https://code-as-policies.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs can generate executable robot policy code from few-shot language examples, with real-hardware demos but no tabulated success rates on novel commands.

read the letter

The paper shows that code-writing LLMs can be prompted with a handful of language-command-plus-policy-code pairs and then produce new policy code for fresh instructions. The policies call perception APIs, use basic control primitives, and incorporate arithmetic or geometry via libraries like NumPy and Shapely. Hierarchical prompting, where the model defines its own helper functions, is the main technical step and lifts the HumanEval score to 39.8 percent while enabling longer, more structured behaviors. They run the generated code on several physical robot arms for pick-and-place and trajectory tasks and release the code and videos.

Referee Report

2 major / 2 minor

Summary. The paper proposes 'Code as Policies,' a framework in which LLMs trained on code completion are repurposed via few-shot prompting to synthesize executable Python robot policies from natural-language commands. Examples consist of language instructions formatted as comments paired with corresponding policy code that calls perception APIs, control primitives, and third-party libraries (NumPy, Shapely) for geometric reasoning and arithmetic. Hierarchical prompting (recursively defining undefined functions) is introduced to generate more complex policies. The approach is claimed to produce policies exhibiting spatial reasoning, generalization to novel instructions, and context-dependent parameter assignment, with demonstrations on multiple real robot platforms and an improvement to 39.8% on the HumanEval benchmark.

Significance. If the empirical claims are substantiated with quantitative robot-task metrics, the work would be significant for bridging LLMs and robotics by offering an interpretable, code-based mechanism for policy generation that supports generalization and commonsense without task-specific fine-tuning. The hierarchical code-generation technique also contributes to LLM program synthesis.

major comments (2)

[Experimental Evaluation] The central claim that few-shot LLM-generated policies execute correctly and generalize on physical robots for novel commands is load-bearing yet supported only by qualitative success cases and videos. No success rates, trial counts, failure-mode analysis, or ablation studies over a held-out set of novel commands are reported in the experimental evaluation, leaving open the possibility that observed behaviors reflect prompt curation rather than reliable autonomous synthesis.
[Real-Robot Demonstrations] The manuscript asserts that generated policies 'prescribe precise values to ambiguous descriptions' and execute safely on hardware, but provides no runtime verification, error-handling analysis, or discussion of failure modes (e.g., API misuse, unsafe velocities) that would be required to substantiate deployment claims.

minor comments (2)

[Abstract] The abstract states an improvement 'to 39.8%' on HumanEval without clarifying the prior state-of-the-art baseline or the exact prompting setup used for that number.
[Approach] Notation for policy code structure (e.g., how perception outputs are typed and passed to control primitives) is introduced informally; a short pseudocode template or explicit API signature table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to include quantitative metrics and expanded analysis of real-robot execution.

read point-by-point responses

Referee: [Experimental Evaluation] The central claim that few-shot LLM-generated policies execute correctly and generalize on physical robots for novel commands is load-bearing yet supported only by qualitative success cases and videos. No success rates, trial counts, failure-mode analysis, or ablation studies over a held-out set of novel commands are reported in the experimental evaluation, leaving open the possibility that observed behaviors reflect prompt curation rather than reliable autonomous synthesis.

Authors: We agree that quantitative evaluation is important for substantiating the central claims. In the revised manuscript we have added a dedicated subsection to the experimental evaluation reporting success rates, trial counts, and failure-mode analysis over a held-out set of novel commands. We also include ablation studies comparing prompting variants to address concerns about prompt curation. revision: yes
Referee: [Real-Robot Demonstrations] The manuscript asserts that generated policies 'prescribe precise values to ambiguous descriptions' and execute safely on hardware, but provides no runtime verification, error-handling analysis, or discussion of failure modes (e.g., API misuse, unsafe velocities) that would be required to substantiate deployment claims.

Authors: We acknowledge that the original manuscript provided limited discussion of these practical aspects. The revision adds an expanded analysis of runtime verification, error-handling mechanisms in the generated policies, and explicit discussion of failure modes including API misuse and unsafe velocities, supported by examples from the robot experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical demonstration of LLM code generation for policies

full rationale

The manuscript presents an empirical technique for repurposing code-trained LLMs via few-shot prompting to synthesize robot policies. No mathematical derivation chain, equations, or fitted parameters exist that reduce outputs to inputs by construction. Claims rest on curated demonstrations, hierarchical prompting, and an external benchmark result (HumanEval), with no load-bearing self-citations or self-definitional steps. The approach applies known prompting methods to a new domain without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs trained for code completion can reliably produce functional robot-control programs when prompted with examples; no free parameters or new entities are introduced beyond standard prompting hyperparameters.

axioms (1)

domain assumption Large language models trained on code completion can synthesize simple Python programs from docstrings
Invoked as the foundation for repurposing the models to robot policy code.

pith-pipeline@v0.9.0 · 5619 in / 1142 out tokens · 35913 ms · 2026-05-15T00:34:09.000934+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BOOKMARKS: Efficient Active Storyline Memory for Role-playing
cs.CL 2026-05 unverdicted novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts
cs.RO 2026-05 unverdicted novelty 7.0

Octopus Protocol enables one-shot hardware onboarding for AI agents by running a five-stage LLM-driven pipeline that probes devices, infers capabilities, generates an MCP server, and deploys it for closed-loop control.
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
cs.RO 2026-05 unverdicted novelty 7.0

Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 7.0

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
cs.RO 2026-03 conditional novelty 7.0

GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
cs.RO 2026-04 unverdicted novelty 6.0

Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
cs.RO 2026-04 unverdicted novelty 6.0

COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
cs.RO 2026-04 unverdicted novelty 6.0

A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user ...
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains
cs.RO 2026-04 unverdicted novelty 6.0

RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
PaLM-E: An Embodied Multimodal Language Model
cs.LG 2023-03 conditional novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
ORICF -- Open Robotics Inference and Control Framework
cs.RO 2026-05 unverdicted novelty 5.0

ORICF is a declarative, model-agnostic robotics framework with YAML specs and edge offloading that reduces robot compute utilization by up to 83% and energy by 66% in a ROS2 demo combining ASR, LLM, and CNN.
Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents
cs.AI 2026-04 unverdicted novelty 5.0

ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.
Environmental Understanding Vision-Language Model for Embodied Agent
cs.CV 2026-04 unverdicted novelty 5.0

EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
cs.AI 2026-05 unverdicted novelty 3.0

Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

Reference graph

Works this paper leans on

145 extracted references · 145 canonical work pages · cited by 22 Pith papers · 14 internal anchors

[1]

Evaluating Large Language Models Trained on Code

M. Chen, J. T worek, H. Jun, Q. Y uan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code, ”arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Mdetr-modulated detection for end-to-end multi-modal understanding,

A. Kamath, M. Singh, Y . LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in ICCV, 2021

work page 2021
[3]

Open-vocabulary object detection via vision and language knowledge distillation,

X. Gu, T .-Y . Lin, W . Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation, ”arXiv:2104.13921, 2021

work page arXiv 2021
[4]

Robots that use language,

S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that use language, ”Review of Control, Robotics, and Autonomous Systems, 2020

work page 2020
[5]

Procedures as a representation for data in a computer program for understanding natural language,

T . Winograd, “Procedures as a representation for data in a computer program for understanding natural language, ”MIT PROJECT MAC, 1971

work page 1971
[6]

What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution,

J. Dzifcak, M. Scheutz, C. Baral, and P . Schermerhorn, “What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution,” inICRA, 2009

work page 2009
[7]

W eakly supervised learning of semantic parsers for mapping instructions to actions,

Y . Artzi and L. Zettlemoyer, “W eakly supervised learning of semantic parsers for mapping instructions to actions, ”TACL, 2013

work page 2013
[8]

Language conditioned imitation learning over unstructured data,

C. Lynch and P . Sermanet, “Language conditioned imitation learning over unstructured data, ”arXiv:2005.07648, 2020

work page arXiv 2005
[9]

Bc-z: Zero-shot task generalization with robotic imitation learning,

E. Jang, A. Irpan, M. Khansari, D. Kappler, F . Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning, ” inCoRL, 2022

work page 2022
[10]

Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

O. Mees, L. Hermann, E. Rosete-Beas, and W . Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, ”RA-L, 2022

work page 2022
[11]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P . Barham, H. W . Chung, C. Sutton, S. Gehrmannet al., “Palm: Scaling language modeling with pathways, ”arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Language models are few-shot learners,

T . Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners, ”NeurIPS, 2020

work page 2020
[13]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models, ”arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

W . Huang, P . Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” arXiv:2201.07207, 2022

work page arXiv 2022
[15]

Large Language Models are Zero-Shot Reasoners

T . Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners, ”arXiv:2205.11916, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Socratic models: Composing zero-shot multimodal reasoning with language,

A. Zeng, A. W ong, S. W elker, K. Choromanski, F . T ombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . V anhoucke et al. , “Socratic models: Composing zero-shot multimodal reasoning with language,” arXiv:2204.00598, 2022

work page arXiv 2022
[17]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzoget al., “Do as i can, not as i say: Grounding language in robotic affordances, ”arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Inner Monologue: Embodied Reasoning through Planning with Language Models

W . Huang, F . Xia, T . Xiao, H. Chan, J. Liang, P . Florence, A. Zeng, J. T omp- son, I. Mordatch, Y . Chebotar, P . Sermanet, N. Brown, T . Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reason- ing through planning with language models, ” inarXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Implicit behavioral cloning,

P . Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. W ahid, L. Downs, A. W ong, J. Lee, I. Mordatch, and J. T ompson, “Implicit behavioral cloning, ” in CoRL, 2022

work page 2022
[20]

Learning visual affordances for robotic manipulation,

A. Zeng, “Learning visual affordances for robotic manipulation,” Ph.D. dissertation, Princeton University, 2019

work page 2019
[21]

Scalable deep reinforcement learning for vision-based robotic manipulation,

D. Kalashnikov, A. Irpan, P . Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . V anhouckeet al., “Scalable deep reinforcement learning for vision-based robotic manipulation, ” inCoRL, 2018

work page 2018
[22]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. W ainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback, ”arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Compositionality decomposed: How do neural networks generalise?

D. Hupkes, V . Dankers, M. Mul, and E. Bruni, “Compositionality decomposed: How do neural networks generalise?”JAIR, 2020

work page 2020
[24]

Social robotics,

C. Breazeal, K. Dautenhahn, and T . Kanda, “Social robotics,”Springer handbook of robotics, 2016

work page 2016
[25]

T oward understanding natural language directions,

T . Kollar, S. Tellex, D. Roy, and N. Roy, “T oward understanding natural language directions, ” inHRI, 2010

work page 2010
[26]

A survey of reinforcement learning informed by natural language,

J. Luketina, N. Nardelli, G. Farquhar, J. N. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T . Rocktäschel, “ A survey of reinforcement learning informed by natural language, ” inIJCAI, 2019

work page 2019
[27]

W alk the talk: Connecting language, knowledge, and action in route instructions,

M. MacMahon, B. Stankiewicz, and B. Kuipers, “W alk the talk: Connecting language, knowledge, and action in route instructions, ”AAAI, 2006

work page 2006
[28]

Learning to interpret natural language commands through human-robot dialog,

J. Thomason, S. Zhang, R. J. Mooney, and P . Stone, “Learning to interpret natural language commands through human-robot dialog, ” inIJCAI, 2015

work page 2015
[29]

Understanding natural language commands for robotic navigation and mobile manipulation,

S. Tellex, T . Kollar, S. Dickerson, M. W alter, A. Banerjee, S. Teller, and N. Roy, “Understanding natural language commands for robotic navigation and mobile manipulation, ” inAAAI, 2011

work page 2011
[30]

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,

D. Shah, B. Osinski, B. Ichter, and S. Levine, “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” arXiv:2207.04429, 2022

work page arXiv 2022
[31]

Learning to parse natural language commands to a robot control system,

C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox, “Learning to parse natural language commands to a robot control system,” inExperimental robotics, 2013

work page 2013
[32]

Jointly improving parsing and perception for natural language commands through human-robot dialog,

J. Thomason, A. Padmakumar, J. Sinapov, N. W alker, Y . Jiang, H. Y edidsion, J. Hart, P . Stone, and R. Mooney, “Jointly improving parsing and perception for natural language commands through human-robot dialog, ”JAIR, 2020

work page 2020
[33]

Learning language-conditioned robot behavior from offline data and crowd-sourced annotation,

S. Nair, E. Mitchell, K. Chen, S. Savarese, C. Finn et al. , “Learning language-conditioned robot behavior from offline data and crowd-sourced annotation, ” inCoRL, 2022

work page 2022
[34]

Learning with Latent Language

J. Andreas, D. Klein, and S. Levine, “Learning with latent language,” arXiv:1711.00482, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Correcting robot plans with natural language feedback,

P . Sharma, B. Sundaralingam, V . Blukis, C. Paxton, T . Hermans, A. T orralba, J. Andreas, and D. Fox, “Correcting robot plans with natural language feedback, ”arXiv:2204.05186, 2022

work page arXiv 2022
[36]

Cliport: What and where pathways for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation, ” inCoRL, 2021

work page 2021
[37]

Language-conditioned imitation learning for robot manipulation tasks,

S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor, “Language-conditioned imitation learning for robot manipulation tasks,” NeurIPS, 2020

work page 2020
[38]

Language as an abstraction for hierarchical deep reinforcement learning,

Y . Jiang, S. S. Gu, K. P . Murphy, and C. Finn, “Language as an abstraction for hierarchical deep reinforcement learning, ”NeurIPS, 2019

work page 2019
[39]

Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards,

P . Goyal, S. Niekum, and R. J. Mooney, “Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards,” arXiv:2007.15543, 2020

work page arXiv 2007
[40]

Self-educated language agent with hindsight experience replay for instruction following,

G. Cideron, M. Seurin, F . Strub, and O. Pietquin, “Self-educated language agent with hindsight experience replay for instruction following, ”DeepMind, 2019

work page 2019
[41]

Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

D. Misra, J. Langford, and Y . Artzi, “Mapping instructions and visual obser- vations to actions with reinforcement learning, ”arXiv:1704.08795, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Grounding language to autonomously-acquired skills via goal generation,

A. Akakzia, C. Colas, P .-Y . Oudeyer, M. Chetouani, and O. Sigaud, “Grounding language to autonomously-acquired skills via goal generation,” arXiv:2006.07185, 2020

work page arXiv 2006
[43]

A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level,

I. Drori, S. Zhang, R. Shuttleworth, L. T ang, A. Lu, E. Ke, K. Liu, L. Chen, S. Tran, N. Chenget al., “ A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level, ”PNAS, 2022

work page 2022
[44]

Solving Quantitative Reasoning Problems with Language Models

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ra- masesh, A. Slone, C. Anil, I. Schlag, T . Gutman-Soloet al., “Solving quan- titative reasoning problems with language models, ”arXiv:2206.14858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

D. Zhou, N. Schärli, L. Hou, J. W ei, N. Scales, X. W ang, D. Schuurmans, O. Bousquet, Q. Le, and E. Chi, “Least-to-most prompting enables complex reasoning in large language models, ”arXiv:2205.10625, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. W ei, X. W ang, D. Schuurmans, M. Bosma, B. Ichter, F . Xia, E. Chi, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models, ”arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models, ”arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning,

K. Ellis, C. W ong, M. Nye, M. Sable-Meyer, L. Cary, L. Morales, L. Hewitt, A. Solar-Lezama, and J. B. T enenbaum, “Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning, ”arXiv:2006.08381, 2020

work page arXiv 2006
[50]

Learning abstract structure for drawing by efficient motor program induction,

L. Tian, K. Ellis, M. Kryven, and J. T enenbaum, “Learning abstract structure for drawing by efficient motor program induction, ”NeurIPS, 2020

work page 2020
[51]

Learning to synthesize programs as interpretable and generalizable policies,

D. Trivedi, J. Zhang, S.-H. Sun, and J. J. Lim, “Learning to synthesize programs as interpretable and generalizable policies, ”NeurIPS, 2021

work page 2021
[52]

Composing pick-and-place tasks by grounding language,

O. Mees and W . Burgard, “Composing pick-and-place tasks by grounding language, ” inISER, 2020

work page 2020
[53]

Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects,

W . Liu, C. Paxton, T . Hermans, and D. Fox, “Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects,” in ICRA, 2022

work page 2022
[54]

Sornet: Spatial object-centric representations for sequential manipulation,

W . Y uan, C. Paxton, K. Desingh, and D. Fox, “Sornet: Spatial object-centric representations for sequential manipulation, ” inCoRL, 2022

work page 2022
[55]

Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers,

A. Bucker, L. Figueredo, S. Haddadin, A. Kapoor, S. Ma, and R. Bonatti, “Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers, ”arXiv:2203.13411, 2022

work page arXiv 2022
[56]

Learning perceptual concepts by bootstrapping from human queries,

A. Bobu, C. Paxton, W . Y ang, B. Sundaralingam, Y .-W . Chao, M. Cakmak, and D. Fox, “Learning perceptual concepts by bootstrapping from human queries, ”RA-L, 2022

work page 2022
[57]

Recursively summarizing books with human feedback,

J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, and P . Christiano, “Recursively summarizing books with human feedback,” arXiv:2109.10862, 2021

work page arXiv 2021
[58]

A systematic evaluation of large language models of code,

F . F . Xu, U. Alon, G. Neubig, and V . J. Hellendoorn, “ A systematic evaluation of large language models of code, ” inMAPS, 2022

work page 2022
[59]

Xirl: Cross-embodiment inverse reinforcement learning,

K. Zakka, A. Zeng, P . Florence, J. T ompson, J. Bohg, and D. Dwibedi, “Xirl: Cross-embodiment inverse reinforcement learning, ” inCoRL. PMLR, 2022

work page 2022
[60]

Implicit kinematic policies: Unifying joint and cartesian action spaces in end-to-end robot learning,

A. Ganapathi, P . Florence, J. V arley, K. Burns, K. Goldberg, and A. Zeng, “Implicit kinematic policies: Unifying joint and cartesian action spaces in end-to-end robot learning,”arXiv:2203.01983, 2022. APPENDIX A. Prompt Engineering Using LMPs to reliably complete tasks via code generation requires careful prompt engineering. While these prompts do not h...

work page arXiv 2022
[61]

ret_val = ’yellow block’ # the blocks

Language-based reasoning:Full prompt: objs = [’green block’, ’green bowl’, ’yellow block’, ’yellow bowl’] # the yellow block. ret_val = ’yellow block’ # the blocks. ret_val = [’green block’, ’yellow block’]

work page
[62]

put_first_on_second(’gray block’, ’gray bowl’) objs = [’purple block’, ’purple bowl’] # move the purple bowl toward the left

First-party: Full prompt: from utils import get_pos, put_first_on_second objs = [’gray block’, ’gray bowl’] # put the gray block on the gray bowl. put_first_on_second(’gray block’, ’gray bowl’) objs = [’purple block’, ’purple bowl’] # move the purple bowl toward the left. target_pos = get_pos(’purple bowl’) + [-0.3, 0] put_first_on_second(’purple bowl’, t...

work page
[63]

put_first_on_second(’cyan block’, ’cyan bowl’) objs = [’gray block’, ’silver block’, ’gray bowl’] # place the top most block on the gray bowl

Combining language reasoning, third-party, and first-party libraries.: Full prompt: import numpy as np from utils import get_pos, put_first_on_second objs = [’cyan block’, ’cyan bowl’, ’pink bowl’] # put the cyan block in cyan bowl. put_first_on_second(’cyan block’, ’cyan bowl’) objs = [’gray block’, ’silver block’, ’gray bowl’] # place the top most block...

work page
[64]

LMPs can be composed.: Full prompt: import numpy as np from utils import get_pos, put_first_on_second, parse_obj objs = [’yellow block’, ’yellow bowl’, ’gray block’, ’gray bowl’] # move the sun colored block toward the left. block_name = parse_obj(’sun colored block’) target_pos = get_pos(block_name) + [-0.3, 0] put_first_on_second(block_name, target_pos)...

work page
[65]

find the name of the block closest to the blue bowl,

parse_obj prompt.: Full prompt: import numpy as np from utils import get_pos objs = [’brown bowl’, ’green block’, ’brown block’, ’green bowl’] # the blocks. ret_val = [’brown block’, ’green block’] # the sky colored block. ret_val = ’blue block’ objs = [’orange block’, ’cyan block’, ’purple bowl’, ’gray bowl’] # the right most block. block_names = [’orang...

work page
[66]

Example Questions: Here are four types of benchmark questions and their examples: • V ector operations with Numpy: pts = interpolate_pts_np(start, end, n) • Simple controls: u = pd_control(x_curr, x_goal, x_dot, Kp, Kv) • Manipulating shapes with shapely: circle = make_circle(radius, center) • Using first-party libraries: ret_val = obj_shape_does_not_cont...

work page
[67]

square" than

Generalization Analysis: W e analyze how well code- generation performs across the fives types of generalizations described in [23], where generalization is evaluated by comparing the examples given in the prompt with the new instructions given in the benchmark. W e give a description of the five types of generalization applied to our benchmark. Specifica...

work page
[68]

draw a 5cm hexagon around the middle

work page
[69]

draw a line that bisects the hexagon

work page
[70]

make them both bigger

work page
[71]

erase the hexagon and the line

work page
[72]

draw the sun as a circle at the top right

work page
[73]

draw the ground as a line at the bottom

work page
[74]

draw a pyramid as a triangle on the ground

work page
[75]

draw a smaller pyramid a little bit to the left

work page
[76]

draw circles around the blocks

work page
[77]

Real-W orld T abletop Manipulation In this domain, a UR5e robot is tasked to manipulate objects on a tabletop according to natural language instructions

draw a square around the sweeter fruit I. Real-W orld T abletop Manipulation In this domain, a UR5e robot is tasked to manipulate objects on a tabletop according to natural language instructions. The robot is equipped with a suction gripper, and it can only perform pick and place actions parameterized by 2D top-down pick and place positions. The robot is ...

work page
[78]

Put the blocks in a horizontal line near the top

work page
[79]

Move the sky-colored block in between the red block and the second block from the left

work page
[80]

Why did you move the green block?

work page

Showing first 80 references.