pith. machine review for the scientific record. sign in

arxiv: 2209.07753 · v4 · submitted 2022-09-16 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Code as Policies: Language Model Programs for Embodied Control

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:34 UTC · model grok-4.3

classification 💻 cs.RO
keywords language modelsrobot policiescode generationfew-shot promptingembodied controlcontrol primitiveshierarchical promptingspatial reasoning
0
0 comments X

The pith

Language models write executable robot policies by composing code from a few example commands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models trained on code completion can be repurposed to generate robot control policies directly from natural language instructions. By providing a handful of example commands paired with corresponding policy code that calls perception functions and control APIs, the model produces new code for unseen tasks. These policies chain logic, reference libraries for arithmetic and geometry, and resolve vague instructions into precise actions using context. If correct, this shifts robot programming from manual scripting toward conversational specification that works across different robot platforms and tasks.

Core claim

When provided as input several example language commands formatted as comments followed by corresponding policy code via few-shot prompting, LLMs can take in new commands and autonomously re-compose API calls to generate new policy code that exhibits spatial-geometric reasoning, generalizes to new instructions, and prescribes precise values to ambiguous descriptions depending on context.

What carries the argument

Hierarchical code generation through recursive prompting, where the model defines undefined functions on the fly to build complex policies that process perception outputs and parameterize control primitives.

If this is right

  • Policies gain spatial-geometric reasoning by chaining classic logic and referencing libraries such as NumPy and Shapely.
  • Generated policies generalize to new instructions without additional training or fine-tuning.
  • Vague language like 'faster' is turned into concrete parameter values using behavioral commonsense encoded in the model.
  • The same prompting approach raises state-of-the-art performance on the HumanEval code benchmark to 39.8 percent.
  • The formulation supports both reactive policies such as impedance controllers and waypoint-based policies such as pick-and-place.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could allow rapid adaptation of robot behavior across different hardware by swapping only the low-level API definitions while keeping the high-level prompt structure fixed.
  • Safety-critical applications would likely require an added runtime monitor layer because the paper's core claim assumes flawless first-try execution.
  • Extending the recursive function definition pattern to multi-robot coordination or long-horizon tasks remains an open direction not tested in the current experiments.

Load-bearing premise

The code produced by the language model will execute correctly and safely on physical robots for novel commands without runtime errors or the need for extra verification.

What would settle it

Running the model on a new instruction such as 'move the mug faster toward the target while avoiding the obstacle' and observing that the generated code either crashes, produces unsafe velocities, or fails to complete the motion on the robot.

read the original abstract

Large language models (LLMs) trained on code completion have been shown to be capable of synthesizing simple Python programs from docstrings [1]. We find that these code-writing LLMs can be re-purposed to write robot policy code, given natural language commands. Specifically, policy code can express functions or feedback loops that process perception outputs (e.g.,from object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example language commands (formatted as comments) followed by corresponding policy code (via few-shot prompting), LLMs can take in new commands and autonomously re-compose API calls to generate new policy code respectively. By chaining classic logic structures and referencing third-party libraries (e.g., NumPy, Shapely) to perform arithmetic, LLMs used in this way can write robot policies that (i) exhibit spatial-geometric reasoning, (ii) generalize to new instructions, and (iii) prescribe precise values (e.g., velocities) to ambiguous descriptions ("faster") depending on context (i.e., behavioral commonsense). This paper presents code as policies: a robot-centric formulation of language model generated programs (LMPs) that can represent reactive policies (e.g., impedance controllers), as well as waypoint-based policies (vision-based pick and place, trajectory-based control), demonstrated across multiple real robot platforms. Central to our approach is prompting hierarchical code-gen (recursively defining undefined functions), which can write more complex code and also improves state-of-the-art to solve 39.8% of problems on the HumanEval [1] benchmark. Code and videos are available at https://code-as-policies.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes 'Code as Policies,' a framework in which LLMs trained on code completion are repurposed via few-shot prompting to synthesize executable Python robot policies from natural-language commands. Examples consist of language instructions formatted as comments paired with corresponding policy code that calls perception APIs, control primitives, and third-party libraries (NumPy, Shapely) for geometric reasoning and arithmetic. Hierarchical prompting (recursively defining undefined functions) is introduced to generate more complex policies. The approach is claimed to produce policies exhibiting spatial reasoning, generalization to novel instructions, and context-dependent parameter assignment, with demonstrations on multiple real robot platforms and an improvement to 39.8% on the HumanEval benchmark.

Significance. If the empirical claims are substantiated with quantitative robot-task metrics, the work would be significant for bridging LLMs and robotics by offering an interpretable, code-based mechanism for policy generation that supports generalization and commonsense without task-specific fine-tuning. The hierarchical code-generation technique also contributes to LLM program synthesis.

major comments (2)
  1. [Experimental Evaluation] The central claim that few-shot LLM-generated policies execute correctly and generalize on physical robots for novel commands is load-bearing yet supported only by qualitative success cases and videos. No success rates, trial counts, failure-mode analysis, or ablation studies over a held-out set of novel commands are reported in the experimental evaluation, leaving open the possibility that observed behaviors reflect prompt curation rather than reliable autonomous synthesis.
  2. [Real-Robot Demonstrations] The manuscript asserts that generated policies 'prescribe precise values to ambiguous descriptions' and execute safely on hardware, but provides no runtime verification, error-handling analysis, or discussion of failure modes (e.g., API misuse, unsafe velocities) that would be required to substantiate deployment claims.
minor comments (2)
  1. [Abstract] The abstract states an improvement 'to 39.8%' on HumanEval without clarifying the prior state-of-the-art baseline or the exact prompting setup used for that number.
  2. [Approach] Notation for policy code structure (e.g., how perception outputs are typed and passed to control primitives) is introduced informally; a short pseudocode template or explicit API signature table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to include quantitative metrics and expanded analysis of real-robot execution.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The central claim that few-shot LLM-generated policies execute correctly and generalize on physical robots for novel commands is load-bearing yet supported only by qualitative success cases and videos. No success rates, trial counts, failure-mode analysis, or ablation studies over a held-out set of novel commands are reported in the experimental evaluation, leaving open the possibility that observed behaviors reflect prompt curation rather than reliable autonomous synthesis.

    Authors: We agree that quantitative evaluation is important for substantiating the central claims. In the revised manuscript we have added a dedicated subsection to the experimental evaluation reporting success rates, trial counts, and failure-mode analysis over a held-out set of novel commands. We also include ablation studies comparing prompting variants to address concerns about prompt curation. revision: yes

  2. Referee: [Real-Robot Demonstrations] The manuscript asserts that generated policies 'prescribe precise values to ambiguous descriptions' and execute safely on hardware, but provides no runtime verification, error-handling analysis, or discussion of failure modes (e.g., API misuse, unsafe velocities) that would be required to substantiate deployment claims.

    Authors: We acknowledge that the original manuscript provided limited discussion of these practical aspects. The revision adds an expanded analysis of runtime verification, error-handling mechanisms in the generated policies, and explicit discussion of failure modes including API misuse and unsafe velocities, supported by examples from the robot experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical demonstration of LLM code generation for policies

full rationale

The manuscript presents an empirical technique for repurposing code-trained LLMs via few-shot prompting to synthesize robot policies. No mathematical derivation chain, equations, or fitted parameters exist that reduce outputs to inputs by construction. Claims rest on curated demonstrations, hierarchical prompting, and an external benchmark result (HumanEval), with no load-bearing self-citations or self-definitional steps. The approach applies known prompting methods to a new domain without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs trained for code completion can reliably produce functional robot-control programs when prompted with examples; no free parameters or new entities are introduced beyond standard prompting hyperparameters.

axioms (1)
  • domain assumption Large language models trained on code completion can synthesize simple Python programs from docstrings
    Invoked as the foundation for repurposing the models to robot policy code.

pith-pipeline@v0.9.0 · 5619 in / 1142 out tokens · 35913 ms · 2026-05-15T00:34:09.000934+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BOOKMARKS: Efficient Active Storyline Memory for Role-playing

    cs.CL 2026-05 unverdicted novelty 7.0

    BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.

  2. Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts

    cs.RO 2026-05 unverdicted novelty 7.0

    Octopus Protocol enables one-shot hardware onboarding for AI agents by running a five-stage LLM-driven pipeline that probes devices, infers capabilities, generates an MCP server, and deploys it for closed-loop control.

  3. Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

    cs.RO 2026-05 unverdicted novelty 7.0

    Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.

  4. Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

    cs.RO 2026-04 unverdicted novelty 7.0

    A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...

  5. Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

    cs.RO 2026-03 conditional novelty 7.0

    GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.

  6. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  7. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  8. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  9. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  10. From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...

  11. Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.

  12. Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.

  13. Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction

    cs.RO 2026-04 unverdicted novelty 6.0

    COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...

  14. ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.

  15. A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring

    cs.RO 2026-04 unverdicted novelty 6.0

    A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user ...

  16. RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains

    cs.RO 2026-04 unverdicted novelty 6.0

    RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.

  17. SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    cs.CR 2026-02 unverdicted novelty 6.0

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  18. PaLM-E: An Embodied Multimodal Language Model

    cs.LG 2023-03 conditional novelty 6.0

    PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...

  19. ORICF -- Open Robotics Inference and Control Framework

    cs.RO 2026-05 unverdicted novelty 5.0

    ORICF is a declarative, model-agnostic robotics framework with YAML specs and edge offloading that reduces robot compute utilization by up to 83% and energy by 66% in a ROS2 demo combining ASR, LLM, and CNN.

  20. Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.

  21. Environmental Understanding Vision-Language Model for Embodied Agent

    cs.CV 2026-04 unverdicted novelty 5.0

    EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.

  22. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    cs.SE 2026-04 accept novelty 5.0

    LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

  23. Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

    cs.AI 2026-05 unverdicted novelty 3.0

    Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

Reference graph

Works this paper leans on

145 extracted references · 145 canonical work pages · cited by 22 Pith papers · 14 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. T worek, H. Jun, Q. Y uan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code, ”arXiv:2107.03374, 2021

  2. [2]

    Mdetr-modulated detection for end-to-end multi-modal understanding,

    A. Kamath, M. Singh, Y . LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in ICCV, 2021

  3. [3]

    Open-vocabulary object detection via vision and language knowledge distillation,

    X. Gu, T .-Y . Lin, W . Kuo, and Y . Cui, “Open-vocabulary object detection via vision and language knowledge distillation, ”arXiv:2104.13921, 2021

  4. [4]

    Robots that use language,

    S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that use language, ”Review of Control, Robotics, and Autonomous Systems, 2020

  5. [5]

    Procedures as a representation for data in a computer program for understanding natural language,

    T . Winograd, “Procedures as a representation for data in a computer program for understanding natural language, ”MIT PROJECT MAC, 1971

  6. [6]

    What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution,

    J. Dzifcak, M. Scheutz, C. Baral, and P . Schermerhorn, “What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution,” inICRA, 2009

  7. [7]

    W eakly supervised learning of semantic parsers for mapping instructions to actions,

    Y . Artzi and L. Zettlemoyer, “W eakly supervised learning of semantic parsers for mapping instructions to actions, ”TACL, 2013

  8. [8]

    Language conditioned imitation learning over unstructured data,

    C. Lynch and P . Sermanet, “Language conditioned imitation learning over unstructured data, ”arXiv:2005.07648, 2020

  9. [9]

    Bc-z: Zero-shot task generalization with robotic imitation learning,

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F . Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning, ” inCoRL, 2022

  10. [10]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,

    O. Mees, L. Hermann, E. Rosete-Beas, and W . Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks, ”RA-L, 2022

  11. [11]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P . Barham, H. W . Chung, C. Sutton, S. Gehrmannet al., “Palm: Scaling language modeling with pathways, ”arXiv:2204.02311, 2022

  12. [12]

    Language models are few-shot learners,

    T . Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners, ”NeurIPS, 2020

  13. [13]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models, ”arXiv:2205.01068, 2022

  14. [14]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

    W . Huang, P . Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” arXiv:2201.07207, 2022

  15. [15]

    Large Language Models are Zero-Shot Reasoners

    T . Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners, ”arXiv:2205.11916, 2022

  16. [16]

    Socratic models: Composing zero-shot multimodal reasoning with language,

    A. Zeng, A. W ong, S. W elker, K. Choromanski, F . T ombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . V anhoucke et al. , “Socratic models: Composing zero-shot multimodal reasoning with language,” arXiv:2204.00598, 2022

  17. [17]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzoget al., “Do as i can, not as i say: Grounding language in robotic affordances, ”arXiv:2204.01691, 2022

  18. [18]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    W . Huang, F . Xia, T . Xiao, H. Chan, J. Liang, P . Florence, A. Zeng, J. T omp- son, I. Mordatch, Y . Chebotar, P . Sermanet, N. Brown, T . Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reason- ing through planning with language models, ” inarXiv:2207.05608, 2022

  19. [19]

    Implicit behavioral cloning,

    P . Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. W ahid, L. Downs, A. W ong, J. Lee, I. Mordatch, and J. T ompson, “Implicit behavioral cloning, ” in CoRL, 2022

  20. [20]

    Learning visual affordances for robotic manipulation,

    A. Zeng, “Learning visual affordances for robotic manipulation,” Ph.D. dissertation, Princeton University, 2019

  21. [21]

    Scalable deep reinforcement learning for vision-based robotic manipulation,

    D. Kalashnikov, A. Irpan, P . Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . V anhouckeet al., “Scalable deep reinforcement learning for vision-based robotic manipulation, ” inCoRL, 2018

  22. [22]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. W ainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback, ”arXiv:2203.02155, 2022

  23. [23]

    Compositionality decomposed: How do neural networks generalise?

    D. Hupkes, V . Dankers, M. Mul, and E. Bruni, “Compositionality decomposed: How do neural networks generalise?”JAIR, 2020

  24. [24]

    Social robotics,

    C. Breazeal, K. Dautenhahn, and T . Kanda, “Social robotics,”Springer handbook of robotics, 2016

  25. [25]

    T oward understanding natural language directions,

    T . Kollar, S. Tellex, D. Roy, and N. Roy, “T oward understanding natural language directions, ” inHRI, 2010

  26. [26]

    A survey of reinforcement learning informed by natural language,

    J. Luketina, N. Nardelli, G. Farquhar, J. N. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T . Rocktäschel, “ A survey of reinforcement learning informed by natural language, ” inIJCAI, 2019

  27. [27]

    W alk the talk: Connecting language, knowledge, and action in route instructions,

    M. MacMahon, B. Stankiewicz, and B. Kuipers, “W alk the talk: Connecting language, knowledge, and action in route instructions, ”AAAI, 2006

  28. [28]

    Learning to interpret natural language commands through human-robot dialog,

    J. Thomason, S. Zhang, R. J. Mooney, and P . Stone, “Learning to interpret natural language commands through human-robot dialog, ” inIJCAI, 2015

  29. [29]

    Understanding natural language commands for robotic navigation and mobile manipulation,

    S. Tellex, T . Kollar, S. Dickerson, M. W alter, A. Banerjee, S. Teller, and N. Roy, “Understanding natural language commands for robotic navigation and mobile manipulation, ” inAAAI, 2011

  30. [30]

    Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,

    D. Shah, B. Osinski, B. Ichter, and S. Levine, “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” arXiv:2207.04429, 2022

  31. [31]

    Learning to parse natural language commands to a robot control system,

    C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox, “Learning to parse natural language commands to a robot control system,” inExperimental robotics, 2013

  32. [32]

    Jointly improving parsing and perception for natural language commands through human-robot dialog,

    J. Thomason, A. Padmakumar, J. Sinapov, N. W alker, Y . Jiang, H. Y edidsion, J. Hart, P . Stone, and R. Mooney, “Jointly improving parsing and perception for natural language commands through human-robot dialog, ”JAIR, 2020

  33. [33]

    Learning language-conditioned robot behavior from offline data and crowd-sourced annotation,

    S. Nair, E. Mitchell, K. Chen, S. Savarese, C. Finn et al. , “Learning language-conditioned robot behavior from offline data and crowd-sourced annotation, ” inCoRL, 2022

  34. [34]

    Learning with Latent Language

    J. Andreas, D. Klein, and S. Levine, “Learning with latent language,” arXiv:1711.00482, 2017

  35. [35]

    Correcting robot plans with natural language feedback,

    P . Sharma, B. Sundaralingam, V . Blukis, C. Paxton, T . Hermans, A. T orralba, J. Andreas, and D. Fox, “Correcting robot plans with natural language feedback, ”arXiv:2204.05186, 2022

  36. [36]

    Cliport: What and where pathways for robotic manipulation,

    M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation, ” inCoRL, 2021

  37. [37]

    Language-conditioned imitation learning for robot manipulation tasks,

    S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor, “Language-conditioned imitation learning for robot manipulation tasks,” NeurIPS, 2020

  38. [38]

    Language as an abstraction for hierarchical deep reinforcement learning,

    Y . Jiang, S. S. Gu, K. P . Murphy, and C. Finn, “Language as an abstraction for hierarchical deep reinforcement learning, ”NeurIPS, 2019

  39. [39]

    Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards,

    P . Goyal, S. Niekum, and R. J. Mooney, “Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards,” arXiv:2007.15543, 2020

  40. [40]

    Self-educated language agent with hindsight experience replay for instruction following,

    G. Cideron, M. Seurin, F . Strub, and O. Pietquin, “Self-educated language agent with hindsight experience replay for instruction following, ”DeepMind, 2019

  41. [41]

    Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

    D. Misra, J. Langford, and Y . Artzi, “Mapping instructions and visual obser- vations to actions with reinforcement learning, ”arXiv:1704.08795, 2017

  42. [42]

    Grounding language to autonomously-acquired skills via goal generation,

    A. Akakzia, C. Colas, P .-Y . Oudeyer, M. Chetouani, and O. Sigaud, “Grounding language to autonomously-acquired skills via goal generation,” arXiv:2006.07185, 2020

  43. [43]

    A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level,

    I. Drori, S. Zhang, R. Shuttleworth, L. T ang, A. Lu, E. Ke, K. Liu, L. Chen, S. Tran, N. Chenget al., “ A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level, ”PNAS, 2022

  44. [44]

    Solving Quantitative Reasoning Problems with Language Models

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ra- masesh, A. Slone, C. Anil, I. Schlag, T . Gutman-Soloet al., “Solving quan- titative reasoning problems with language models, ”arXiv:2206.14858, 2022

  45. [45]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” arXiv:2110.14168, 2021

  46. [46]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    D. Zhou, N. Schärli, L. Hou, J. W ei, N. Scales, X. W ang, D. Schuurmans, O. Bousquet, Q. Le, and E. Chi, “Least-to-most prompting enables complex reasoning in large language models, ”arXiv:2205.10625, 2022

  47. [47]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    J. W ei, X. W ang, D. Schuurmans, M. Bosma, B. Ichter, F . Xia, E. Chi, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models, ”arXiv:2201.11903, 2022

  48. [48]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models, ”arXiv:2108.07732, 2021

  49. [49]

    Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning,

    K. Ellis, C. W ong, M. Nye, M. Sable-Meyer, L. Cary, L. Morales, L. Hewitt, A. Solar-Lezama, and J. B. T enenbaum, “Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning, ”arXiv:2006.08381, 2020

  50. [50]

    Learning abstract structure for drawing by efficient motor program induction,

    L. Tian, K. Ellis, M. Kryven, and J. T enenbaum, “Learning abstract structure for drawing by efficient motor program induction, ”NeurIPS, 2020

  51. [51]

    Learning to synthesize programs as interpretable and generalizable policies,

    D. Trivedi, J. Zhang, S.-H. Sun, and J. J. Lim, “Learning to synthesize programs as interpretable and generalizable policies, ”NeurIPS, 2021

  52. [52]

    Composing pick-and-place tasks by grounding language,

    O. Mees and W . Burgard, “Composing pick-and-place tasks by grounding language, ” inISER, 2020

  53. [53]

    Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects,

    W . Liu, C. Paxton, T . Hermans, and D. Fox, “Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects,” in ICRA, 2022

  54. [54]

    Sornet: Spatial object-centric representations for sequential manipulation,

    W . Y uan, C. Paxton, K. Desingh, and D. Fox, “Sornet: Spatial object-centric representations for sequential manipulation, ” inCoRL, 2022

  55. [55]

    Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers,

    A. Bucker, L. Figueredo, S. Haddadin, A. Kapoor, S. Ma, and R. Bonatti, “Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers, ”arXiv:2203.13411, 2022

  56. [56]

    Learning perceptual concepts by bootstrapping from human queries,

    A. Bobu, C. Paxton, W . Y ang, B. Sundaralingam, Y .-W . Chao, M. Cakmak, and D. Fox, “Learning perceptual concepts by bootstrapping from human queries, ”RA-L, 2022

  57. [57]

    Recursively summarizing books with human feedback,

    J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, and P . Christiano, “Recursively summarizing books with human feedback,” arXiv:2109.10862, 2021

  58. [58]

    A systematic evaluation of large language models of code,

    F . F . Xu, U. Alon, G. Neubig, and V . J. Hellendoorn, “ A systematic evaluation of large language models of code, ” inMAPS, 2022

  59. [59]

    Xirl: Cross-embodiment inverse reinforcement learning,

    K. Zakka, A. Zeng, P . Florence, J. T ompson, J. Bohg, and D. Dwibedi, “Xirl: Cross-embodiment inverse reinforcement learning, ” inCoRL. PMLR, 2022

  60. [60]

    Implicit kinematic policies: Unifying joint and cartesian action spaces in end-to-end robot learning,

    A. Ganapathi, P . Florence, J. V arley, K. Burns, K. Goldberg, and A. Zeng, “Implicit kinematic policies: Unifying joint and cartesian action spaces in end-to-end robot learning,”arXiv:2203.01983, 2022. APPENDIX A. Prompt Engineering Using LMPs to reliably complete tasks via code generation requires careful prompt engineering. While these prompts do not h...

  61. [61]

    ret_val = ’yellow block’ # the blocks

    Language-based reasoning:Full prompt: objs = [’green block’, ’green bowl’, ’yellow block’, ’yellow bowl’] # the yellow block. ret_val = ’yellow block’ # the blocks. ret_val = [’green block’, ’yellow block’]

  62. [62]

    put_first_on_second(’gray block’, ’gray bowl’) objs = [’purple block’, ’purple bowl’] # move the purple bowl toward the left

    First-party: Full prompt: from utils import get_pos, put_first_on_second objs = [’gray block’, ’gray bowl’] # put the gray block on the gray bowl. put_first_on_second(’gray block’, ’gray bowl’) objs = [’purple block’, ’purple bowl’] # move the purple bowl toward the left. target_pos = get_pos(’purple bowl’) + [-0.3, 0] put_first_on_second(’purple bowl’, t...

  63. [63]

    put_first_on_second(’cyan block’, ’cyan bowl’) objs = [’gray block’, ’silver block’, ’gray bowl’] # place the top most block on the gray bowl

    Combining language reasoning, third-party, and first-party libraries.: Full prompt: import numpy as np from utils import get_pos, put_first_on_second objs = [’cyan block’, ’cyan bowl’, ’pink bowl’] # put the cyan block in cyan bowl. put_first_on_second(’cyan block’, ’cyan bowl’) objs = [’gray block’, ’silver block’, ’gray bowl’] # place the top most block...

  64. [64]

    LMPs can be composed.: Full prompt: import numpy as np from utils import get_pos, put_first_on_second, parse_obj objs = [’yellow block’, ’yellow bowl’, ’gray block’, ’gray bowl’] # move the sun colored block toward the left. block_name = parse_obj(’sun colored block’) target_pos = get_pos(block_name) + [-0.3, 0] put_first_on_second(block_name, target_pos)...

  65. [65]

    find the name of the block closest to the blue bowl,

    parse_obj prompt.: Full prompt: import numpy as np from utils import get_pos objs = [’brown bowl’, ’green block’, ’brown block’, ’green bowl’] # the blocks. ret_val = [’brown block’, ’green block’] # the sky colored block. ret_val = ’blue block’ objs = [’orange block’, ’cyan block’, ’purple bowl’, ’gray bowl’] # the right most block. block_names = [’orang...

  66. [66]

    Example Questions: Here are four types of benchmark questions and their examples: • V ector operations with Numpy: pts = interpolate_pts_np(start, end, n) • Simple controls: u = pd_control(x_curr, x_goal, x_dot, Kp, Kv) • Manipulating shapes with shapely: circle = make_circle(radius, center) • Using first-party libraries: ret_val = obj_shape_does_not_cont...

  67. [67]

    square" than

    Generalization Analysis: W e analyze how well code- generation performs across the fives types of generalizations described in [23], where generalization is evaluated by comparing the examples given in the prompt with the new instructions given in the benchmark. W e give a description of the five types of generalization applied to our benchmark. Specifica...

  68. [68]

    draw a 5cm hexagon around the middle

  69. [69]

    draw a line that bisects the hexagon

  70. [70]

    make them both bigger

  71. [71]

    erase the hexagon and the line

  72. [72]

    draw the sun as a circle at the top right

  73. [73]

    draw the ground as a line at the bottom

  74. [74]

    draw a pyramid as a triangle on the ground

  75. [75]

    draw a smaller pyramid a little bit to the left

  76. [76]

    draw circles around the blocks

  77. [77]

    Real-W orld T abletop Manipulation In this domain, a UR5e robot is tasked to manipulate objects on a tabletop according to natural language instructions

    draw a square around the sweeter fruit I. Real-W orld T abletop Manipulation In this domain, a UR5e robot is tasked to manipulate objects on a tabletop according to natural language instructions. The robot is equipped with a suction gripper, and it can only perform pick and place actions parameterized by 2D top-down pick and place positions. The robot is ...

  78. [78]

    Put the blocks in a horizontal line near the top

  79. [79]

    Move the sky-colored block in between the red block and the second block from the left

  80. [80]

    Why did you move the green block?

Showing first 80 references.