Bootstrapping Human-Like Planning via LLMs

David Porfirio; Laura M. Hiatt; Leslie Smith; Morgan Fine-Morris; Vincent Hsiao

arxiv: 2506.22604 · v1 · pith:RLA6UPTTnew · submitted 2025-06-27 · 💻 cs.AI · cs.HC· cs.RO

Bootstrapping Human-Like Planning via LLMs

David Porfirio , Vincent Hsiao , Morgan Fine-Morris , Leslie Smith , Laura M. Hiatt This is my paper

Pith reviewed 2026-05-21 23:51 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.RO

keywords large language modelsrobot planningnatural language interfacesend-user programmingaction sequenceshuman-like planning

0 comments

The pith

Large language models can produce human-like action sequences for robots from natural language inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a pipeline using large language models to convert natural language task descriptions into action sequences at the same level of detail a human would specify for a robot. These outputs are then compared against a collection of action sequences that humans created by hand for the same tasks. Results indicate that bigger models generate sequences more similar to the human ones, although smaller models still produce results that are good enough for practical use. This combination allows for intuitive natural language specification while aiming for the precision of detailed action lists in robot programming.

Core claim

An LLM-based pipeline accepts natural language as input and produces human-like action sequences as output at a granularity matching human specification, and comparison to human hand-specified sequences shows larger models outperform smaller ones while smaller models remain satisfactory.

What carries the argument

LLM pipeline that maps natural language descriptions to detailed, human-granularity action sequences for robot tasks.

If this is right

Robot end users can specify tasks using natural language rather than only drag-and-drop interfaces.
Smaller language models can be deployed for generating adequate human-like plans without needing the largest available systems.
The approach merges the intuitiveness of language with the meticulous control of step-by-step action specification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such pipelines could enable more people to program robots without specialized training.
Extending this to real robot execution might validate whether the sequences actually complete the tasks successfully.
Similar methods could bootstrap human-like planning in non-robot domains such as software automation or game design.

Load-bearing premise

The hand-specified action sequences collected from humans provide a valid and sufficient gold standard for determining if LLM outputs are human-like.

What would settle it

If independent human raters consistently judge the LLM-generated sequences as less natural, less precise, or less effective than the hand-specified human sequences for achieving the same tasks.

Figures

Figures reproduced from arXiv: 2506.22604 by David Porfirio, Laura M. Hiatt, Leslie Smith, Morgan Fine-Morris, Vincent Hsiao.

**Figure 2.** Figure 2: The CAS pipeline for translating NL commands into action steps. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Simple data point from the Task Traces dataset [30]. Of the 207 action sequences, we first eliminated 150 that were purely social tasks or contained critical actions that lacked NL descriptions. Determining critical actions involved an author reviewing each action by hand to determine if its inclusion in the sequence is implied by a future action; if not, the action is critical. For example, if an action … view at source ↗

**Figure 4.** Figure 4: Results for Action Similarity (left), Final State Similarity (center), and Length Discrepancy (right). †, *, and ** denote p < 0.1, p < 0.05, and p < 0.01, respectively. sequence using the measures discussed in §IV-C. Recall that for each action sequence, there are three NL summaries— two originating from the authors and one originating from an LLM. For each of our measures, we averaged the output from eac… view at source ↗

read the original abstract

Robot end users increasingly require accessible means of specifying tasks for robots to perform. Two common end-user programming paradigms include drag-and-drop interfaces and natural language programming. Although natural language interfaces harness an intuitive form of human communication, drag-and-drop interfaces enable users to meticulously and precisely dictate the key actions of the robot's task. In this paper, we investigate the degree to which both approaches can be combined. Specifically, we construct a large language model (LLM)-based pipeline that accepts natural language as input and produces human-like action sequences as output, specified at a level of granularity that a human would produce. We then compare these generated action sequences to another dataset of hand-specified action sequences. Although our results reveal that larger models tend to outperform smaller ones in the production of human-like action sequences, smaller models nonetheless achieve satisfactory performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows an LLM pipeline can generate robot action sequences matching human granularity from natural language, but the evaluation depends on an unvalidated hand-specified reference set.

read the letter

The main thing to know is that this paper shows how to use an LLM to convert natural language into robot action sequences that match the detail level humans typically provide, with bigger models doing a better job but smaller ones still performing adequately. They put together a pipeline for this and tested it by comparing the outputs to a set of sequences that humans wrote out by hand for the same tasks. This targets a useful spot in end-user robot programming, where natural language feels easy but can lack precision, and the results suggest a way to bridge that. The work is straightforward and focuses on a real application. It gives some evidence that model scale affects how closely the outputs resemble human planning at that granularity. The soft spot is the choice of gold standard. The paper uses those hand-specified sequences to judge success, but there's no sign they checked how much agreement there is between different humans on the same task or whether the set represents general human behavior. That makes the claims about 'human-like' and the model size trend less convincing than they could be. Details on the number of tasks tested or the exact metrics used are also missing from what I can see, which makes it tough to assess the strength of the findings. This paper is for researchers interested in making robot interfaces more accessible through language models. A reader in human-robot interaction or applied LLMs might get some ideas from the pipeline and the comparison approach. I think it should go to peer review. The topic is relevant and the basic setup is sound, but referees would likely ask for more validation on the human data and better reporting of the experiments.

Referee Report

2 major / 1 minor

Summary. The paper constructs an LLM-based pipeline that accepts natural language task descriptions for robots and generates action sequences at a granularity comparable to human-specified ones. It evaluates the outputs via direct comparison to a separate dataset of hand-specified human action sequences, concluding that larger models produce more human-like sequences than smaller models while smaller models still achieve satisfactory performance.

Significance. If the evaluation holds, the work demonstrates a practical way to combine natural language intuitiveness with precise action specification for end-user robot programming. The scaling observation on model size offers a concrete empirical signal about LLM planning capabilities at human-like granularity.

major comments (2)

[§4] §4 (Evaluation): The central performance claims rest on similarity to a single hand-specified human reference dataset, yet the manuscript reports neither inter-annotator agreement, multiple independent annotations per task, nor controls for annotator variability. Because both the 'larger models outperform' trend and the 'smaller models achieve satisfactory performance' conclusion are defined solely relative to this reference, the absence of validation that the reference captures typical rather than idiosyncratic human planning is load-bearing.
[Abstract and §4.1] Abstract and §4.1: No sample size (number of tasks or sequences), explicit similarity metric, statistical test, or power analysis is supplied for the model-size comparison. This prevents assessment of whether the reported trend is statistically reliable or merely descriptive.

minor comments (1)

[Abstract and §1] The abstract and introduction could more clearly distinguish the proposed pipeline from prior LLM planning work by citing specific granularity differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4] §4 (Evaluation): The central performance claims rest on similarity to a single hand-specified human reference dataset, yet the manuscript reports neither inter-annotator agreement, multiple independent annotations per task, nor controls for annotator variability. Because both the 'larger models outperform' trend and the 'smaller models achieve satisfactory performance' conclusion are defined solely relative to this reference, the absence of validation that the reference captures typical rather than idiosyncratic human planning is load-bearing.

Authors: We agree that the evaluation depends on a single human reference dataset and that the absence of reported inter-annotator agreement or variability controls is a limitation for claiming the sequences are representative of typical human planning. We will revise §4 to discuss this explicitly, reference any details available from the original dataset source, and note it as a boundary condition on our conclusions. We cannot retroactively obtain new multi-annotator data for the existing reference without additional studies, but the added discussion will clarify the scope of the claims. revision: partial
Referee: [Abstract and §4.1] Abstract and §4.1: No sample size (number of tasks or sequences), explicit similarity metric, statistical test, or power analysis is supplied for the model-size comparison. This prevents assessment of whether the reported trend is statistically reliable or merely descriptive.

Authors: We accept that the current presentation lacks explicit sample size, a clearly stated similarity metric, statistical tests, and power analysis for the model-size results. We will revise both the abstract and §4.1 to supply these details, including the number of tasks and sequences evaluated, the precise metric used to compare generated and human sequences, the outcome of appropriate statistical tests on the size trend, and a power analysis. These changes will make the empirical support for the scaling observation transparent and assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to external human dataset

full rationale

The paper constructs an LLM pipeline to generate action sequences from natural language and evaluates them by direct comparison to a separate hand-specified human dataset. No equations, derivations, fitted parameters, or self-citations are described that reduce the central claims (larger models outperform smaller ones; smaller models still satisfactory) to inputs by construction. The evaluation is self-contained against an external reference set with no load-bearing reduction to the paper's own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that human hand-specified sequences are the appropriate reference for human-like planning and that surface-level sequence comparison captures the intended notion of human-likeness.

axioms (1)

domain assumption Human hand-specified action sequences constitute the target distribution for 'human-like' robot planning.
Invoked when the paper states it compares generated sequences to a dataset of hand-specified sequences.

pith-pipeline@v0.9.0 · 5680 in / 1143 out tokens · 33466 ms · 2026-05-21T23:51:29.866529+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct a large language model (LLM)-based pipeline that accepts natural language as input and produces human-like action sequences as output... compare these generated action sequences to another dataset of hand-specified action sequences.
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

larger models tend to outperform smaller ones in the production of human-like action sequences

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

Code3: A system for end-to-end program- ming of mobile manipulator robots for novices and experts,

J. Huang and M. Cakmak, “Code3: A system for end-to-end program- ming of mobile manipulator robots for novices and experts,” in 12th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2017

work page 2017
[2]

Vipo: Spatial-visual programming with functions for robot-iot workflows,

G. Huang, P. S. Rao, M.-H. Wu, X. Qian, S. Y . Nof, K. Ramani, and A. J. Quinn, “Vipo: Spatial-visual programming with functions for robot-iot workflows,” in Proc. CHI Conf. on Human Factors in Comput. Syst., 2020

work page 2020
[3]

Goal-oriented end-user programming of robots,

D. Porfirio, M. Roberts, and L. M. Hiatt, “Goal-oriented end-user programming of robots,” in 19th ACM/IEEE Int. Conf. on Human- Robot Interact., 2024

work page 2024
[4]

Human-centered decision support for agenda scheduling,

S. Rosenthal and L. M. Hiatt, “Human-centered decision support for agenda scheduling,” in Proc. 19th Int. Conf. on Autonomous Agents and MultiAgent Syst. , 2020

work page 2020
[5]

CMRadar: A personal assistant agent for calendar management,

P. J. Modi, M. Veloso, S. F. Smith, and J. Oh, “CMRadar: A personal assistant agent for calendar management,” in Int. Bi-Conf. Workshop on Agent-Oriented Information Syst. Springer, 2004

work page 2004
[6]

J. A. Auld, Agent-based dynamic activity planning and travel schedul- ing model: Data collection and model development . University of Illinois at Chicago, 2011

work page 2011
[7]

Explainable AI: Beware of inmates running the asylum or: How I learnt to stop worrying and love the social and behavioural sciences,

T. Miller, P. Howe, and L. Sonenberg, “Explainable AI: Beware of inmates running the asylum or: How I learnt to stop worrying and love the social and behavioural sciences,” in IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI) , 2017

work page 2017
[8]

Plan explanations as model reconciliation – an empirical study,

T. Chakraborti, S. Sreedharan, S. Grover, and S. Kambhampati, “Plan explanations as model reconciliation – an empirical study,” in 14th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2019

work page 2019
[9]

LLM-Planner: Few-shot grounded planning for embodied agents with large language models,

C. H. Song, B. M. Sadler, J. Wu, W.-L. Chao, C. Washington, and Y . Su, “LLM-Planner: Few-shot grounded planning for embodied agents with large language models,” in Proc. IEEE/CVF Int. Conf. on Computer Vision , 2023

work page 2023
[10]

Generalized planning in pddl domains with pretrained large language models,

T. Silver, S. Dan, K. Srinivas, J. B. Tenenbaum, L. Kaelbling, and M. Katz, “Generalized planning in pddl domains with pretrained large language models,” Proc. AAAI Conf. on Artif. Intell. , 2024

work page 2024
[11]

Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,

S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. P. Saldyt, and A. B Murthy, “Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,” in Proc. 41st Int. Conf. on Machine Learning , 2024

work page 2024
[12]

Iterative design of a system for programming socially interactive service robots,

M. J.-Y . Chung, J. Huang, L. Takayama, T. Lau, and M. Cakmak, “Iterative design of a system for programming socially interactive service robots,” in Social Robotics, 2016

work page 2016
[13]

An interaction design framework for social robots,

D. Glas, S. Satake, T. Kanda, and N. Hagita, “An interaction design framework for social robots,” in Proc. Robot.: Sci. and Syst. , 2011

work page 2011
[14]

Choregraphe: a graphical tool for humanoid robot programming,

E. Pot, J. Monceaux, R. Gelin, and B. Maisonnier, “Choregraphe: a graphical tool for humanoid robot programming,” in 18th IEEE Int. Symp. on Robot and Human Interactive Commun. , 2009

work page 2009
[15]

Trigger-action programming for personalising humanoid robot behaviour,

N. Leonardi, M. Manca, F. Patern `o, and C. Santoro, “Trigger-action programming for personalising humanoid robot behaviour,” in Proc. 2019 CHI Conf. on Human Factors in Comput. Syst. , 2019

work page 2019
[16]

V .Ra: An in-situ visual authoring system for robot-IoT task planning with augmented reality,

Y . Cao, Z. Xu, F. Li, W. Zhong, K. Huo, and K. Ramani, “V .Ra: An in-situ visual authoring system for robot-IoT task planning with augmented reality,” in Proc. Designing Interactive Syst. Conf. , 2019

work page 2019
[17]

Situated live programming for human-robot collaboration,

E. Senft, M. Hagenow, R. Radwin, M. Zinn, M. Gleicher, and B. Mutlu, “Situated live programming for human-robot collaboration,” in ACM Symp. User Interface Softw. Technol. , 2021

work page 2021
[18]

Marcer: Multimodal augmented reality for composing and executing robot tasks,

B. Ikeda, M. Gramopadhye, L. Nekervis, and D. Szafir, “Marcer: Multimodal augmented reality for composing and executing robot tasks,” in 20th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2025

work page 2025
[19]

Cocobo: Exploring large language models as the engine for end-user robot programming,

Y . Ge, Y . Dai, R. Shan, K. Li, Y . Hu, and X. Sun, “Cocobo: Exploring large language models as the engine for end-user robot programming,” in IEEE Symp. Vis. Lang. Human-Centric Comput. , 2024

work page 2024
[20]

Vajra: step-by- step programming with natural language,

V . Schlegel, B. Lang, S. Handschuh, and A. Freitas, “Vajra: step-by- step programming with natural language,” in Proc. 24th Int. Conf. on Intelligent User Interfaces , 2019

work page 2019
[21]

End-user programming of a social robot by dialog,

J. F. Gorostiza and M. A. Salichs, “End-user programming of a social robot by dialog,” Robot. Auton. Syst. , vol. 59, 2011

work page 2011
[22]

Capirci: A multi-modal system for collaborative robot programming,

S. Beschi, D. Fogli, and F. Tampalini, “Capirci: A multi-modal system for collaborative robot programming,” in End-User Develop., 2019

work page 2019
[23]

Natu- ral language interface for programming sensory-enabled scenarios for human-robot interaction,

N. G. Buchina, P. Sterkenburg, T. Lourens, and E. I. Barakova, “Natu- ral language interface for programming sensory-enabled scenarios for human-robot interaction,” in 28th IEEE International Conf. on Robot and Human Interactive Commun. , 2019

work page 2019
[24]

Alchemist: LLM-aided end-user development of robot applications,

U. B. Karli, J.-T. Chen, V . N. Antony, and C.-M. Huang, “Alchemist: LLM-aided end-user development of robot applications,” in 19th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2024

work page 2024
[25]

Imageinthat: Manipulating images to convey user instructions to robots,

K. Mahadevan, B. Lewis, J. Li, B. Mutlu, A. Tang, and T. Grossman, “Imageinthat: Manipulating images to convey user instructions to robots,” in20th ACM/IEEE Int. Conf. on Human-Robot Interact., 2025

work page 2025
[26]

Planning in strawberry fields: Evaluating and improving the planning and scheduling capabilities of lrm o1,

K. Valmeekam, K. Stechly, A. Gundawar, and S. Kambhampati, “Planning in strawberry fields: Evaluating and improving the planning and scheduling capabilities of lrm o1,” arXiv:2410.02162, 2024

work page arXiv 2024
[27]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al. , “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

How people explain action (and autonomous intelligent systems should too)

M. M. de Graaf and B. F. Malle, “How people explain action (and autonomous intelligent systems should too).” in AAAI Fall Symp. on Artificial Intelligence for Human-Robot Interaction , 2017

work page 2017
[29]

Virtualhome: Simulating household activities via programs,

X. Puig et al. , “Virtualhome: Simulating household activities via programs,” in 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2018

work page 2018
[30]

Crowdsourcing task traces for service robotics,

D. Porfirio, A. Saupp ´e, M. Cakmak, A. Albarghouthi, and B. Mutlu, “Crowdsourcing task traces for service robotics,” in ACM/IEEE Int. Conf. on Human-Robot Interact. , 2023

work page 2023
[31]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks,

M. Shridhar et al. , “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition , 2020

work page 2020
[32]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” in Advances in Neural Infor- mation Processing Syst. , 2023

work page 2023
[33]

Plan stability: Replan- ning versus plan repair,

M. Fox, A. Gerevini, D. Long, and I. Serina, “Plan stability: Replan- ning versus plan repair,” in Proc. Int. Conf. on Automated Planning and Scheduling, 2006

work page 2006
[34]

An interaction specification language for robot application development,

D. Porfirio, M. Roberts, and L. M. Hiatt, “An interaction specification language for robot application development,” in 20th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2025

work page 2025

[1] [1]

Code3: A system for end-to-end program- ming of mobile manipulator robots for novices and experts,

J. Huang and M. Cakmak, “Code3: A system for end-to-end program- ming of mobile manipulator robots for novices and experts,” in 12th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2017

work page 2017

[2] [2]

Vipo: Spatial-visual programming with functions for robot-iot workflows,

G. Huang, P. S. Rao, M.-H. Wu, X. Qian, S. Y . Nof, K. Ramani, and A. J. Quinn, “Vipo: Spatial-visual programming with functions for robot-iot workflows,” in Proc. CHI Conf. on Human Factors in Comput. Syst., 2020

work page 2020

[3] [3]

Goal-oriented end-user programming of robots,

D. Porfirio, M. Roberts, and L. M. Hiatt, “Goal-oriented end-user programming of robots,” in 19th ACM/IEEE Int. Conf. on Human- Robot Interact., 2024

work page 2024

[4] [4]

Human-centered decision support for agenda scheduling,

S. Rosenthal and L. M. Hiatt, “Human-centered decision support for agenda scheduling,” in Proc. 19th Int. Conf. on Autonomous Agents and MultiAgent Syst. , 2020

work page 2020

[5] [5]

CMRadar: A personal assistant agent for calendar management,

P. J. Modi, M. Veloso, S. F. Smith, and J. Oh, “CMRadar: A personal assistant agent for calendar management,” in Int. Bi-Conf. Workshop on Agent-Oriented Information Syst. Springer, 2004

work page 2004

[6] [6]

J. A. Auld, Agent-based dynamic activity planning and travel schedul- ing model: Data collection and model development . University of Illinois at Chicago, 2011

work page 2011

[7] [7]

Explainable AI: Beware of inmates running the asylum or: How I learnt to stop worrying and love the social and behavioural sciences,

T. Miller, P. Howe, and L. Sonenberg, “Explainable AI: Beware of inmates running the asylum or: How I learnt to stop worrying and love the social and behavioural sciences,” in IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI) , 2017

work page 2017

[8] [8]

Plan explanations as model reconciliation – an empirical study,

T. Chakraborti, S. Sreedharan, S. Grover, and S. Kambhampati, “Plan explanations as model reconciliation – an empirical study,” in 14th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2019

work page 2019

[9] [9]

LLM-Planner: Few-shot grounded planning for embodied agents with large language models,

C. H. Song, B. M. Sadler, J. Wu, W.-L. Chao, C. Washington, and Y . Su, “LLM-Planner: Few-shot grounded planning for embodied agents with large language models,” in Proc. IEEE/CVF Int. Conf. on Computer Vision , 2023

work page 2023

[10] [10]

Generalized planning in pddl domains with pretrained large language models,

T. Silver, S. Dan, K. Srinivas, J. B. Tenenbaum, L. Kaelbling, and M. Katz, “Generalized planning in pddl domains with pretrained large language models,” Proc. AAAI Conf. on Artif. Intell. , 2024

work page 2024

[11] [11]

Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,

S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. P. Saldyt, and A. B Murthy, “Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks,” in Proc. 41st Int. Conf. on Machine Learning , 2024

work page 2024

[12] [12]

Iterative design of a system for programming socially interactive service robots,

M. J.-Y . Chung, J. Huang, L. Takayama, T. Lau, and M. Cakmak, “Iterative design of a system for programming socially interactive service robots,” in Social Robotics, 2016

work page 2016

[13] [13]

An interaction design framework for social robots,

D. Glas, S. Satake, T. Kanda, and N. Hagita, “An interaction design framework for social robots,” in Proc. Robot.: Sci. and Syst. , 2011

work page 2011

[14] [14]

Choregraphe: a graphical tool for humanoid robot programming,

E. Pot, J. Monceaux, R. Gelin, and B. Maisonnier, “Choregraphe: a graphical tool for humanoid robot programming,” in 18th IEEE Int. Symp. on Robot and Human Interactive Commun. , 2009

work page 2009

[15] [15]

Trigger-action programming for personalising humanoid robot behaviour,

N. Leonardi, M. Manca, F. Patern `o, and C. Santoro, “Trigger-action programming for personalising humanoid robot behaviour,” in Proc. 2019 CHI Conf. on Human Factors in Comput. Syst. , 2019

work page 2019

[16] [16]

V .Ra: An in-situ visual authoring system for robot-IoT task planning with augmented reality,

Y . Cao, Z. Xu, F. Li, W. Zhong, K. Huo, and K. Ramani, “V .Ra: An in-situ visual authoring system for robot-IoT task planning with augmented reality,” in Proc. Designing Interactive Syst. Conf. , 2019

work page 2019

[17] [17]

Situated live programming for human-robot collaboration,

E. Senft, M. Hagenow, R. Radwin, M. Zinn, M. Gleicher, and B. Mutlu, “Situated live programming for human-robot collaboration,” in ACM Symp. User Interface Softw. Technol. , 2021

work page 2021

[18] [18]

Marcer: Multimodal augmented reality for composing and executing robot tasks,

B. Ikeda, M. Gramopadhye, L. Nekervis, and D. Szafir, “Marcer: Multimodal augmented reality for composing and executing robot tasks,” in 20th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2025

work page 2025

[19] [19]

Cocobo: Exploring large language models as the engine for end-user robot programming,

Y . Ge, Y . Dai, R. Shan, K. Li, Y . Hu, and X. Sun, “Cocobo: Exploring large language models as the engine for end-user robot programming,” in IEEE Symp. Vis. Lang. Human-Centric Comput. , 2024

work page 2024

[20] [20]

Vajra: step-by- step programming with natural language,

V . Schlegel, B. Lang, S. Handschuh, and A. Freitas, “Vajra: step-by- step programming with natural language,” in Proc. 24th Int. Conf. on Intelligent User Interfaces , 2019

work page 2019

[21] [21]

End-user programming of a social robot by dialog,

J. F. Gorostiza and M. A. Salichs, “End-user programming of a social robot by dialog,” Robot. Auton. Syst. , vol. 59, 2011

work page 2011

[22] [22]

Capirci: A multi-modal system for collaborative robot programming,

S. Beschi, D. Fogli, and F. Tampalini, “Capirci: A multi-modal system for collaborative robot programming,” in End-User Develop., 2019

work page 2019

[23] [23]

Natu- ral language interface for programming sensory-enabled scenarios for human-robot interaction,

N. G. Buchina, P. Sterkenburg, T. Lourens, and E. I. Barakova, “Natu- ral language interface for programming sensory-enabled scenarios for human-robot interaction,” in 28th IEEE International Conf. on Robot and Human Interactive Commun. , 2019

work page 2019

[24] [24]

Alchemist: LLM-aided end-user development of robot applications,

U. B. Karli, J.-T. Chen, V . N. Antony, and C.-M. Huang, “Alchemist: LLM-aided end-user development of robot applications,” in 19th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2024

work page 2024

[25] [25]

Imageinthat: Manipulating images to convey user instructions to robots,

K. Mahadevan, B. Lewis, J. Li, B. Mutlu, A. Tang, and T. Grossman, “Imageinthat: Manipulating images to convey user instructions to robots,” in20th ACM/IEEE Int. Conf. on Human-Robot Interact., 2025

work page 2025

[26] [26]

Planning in strawberry fields: Evaluating and improving the planning and scheduling capabilities of lrm o1,

K. Valmeekam, K. Stechly, A. Gundawar, and S. Kambhampati, “Planning in strawberry fields: Evaluating and improving the planning and scheduling capabilities of lrm o1,” arXiv:2410.02162, 2024

work page arXiv 2024

[27] [27]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al. , “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

How people explain action (and autonomous intelligent systems should too)

M. M. de Graaf and B. F. Malle, “How people explain action (and autonomous intelligent systems should too).” in AAAI Fall Symp. on Artificial Intelligence for Human-Robot Interaction , 2017

work page 2017

[29] [29]

Virtualhome: Simulating household activities via programs,

X. Puig et al. , “Virtualhome: Simulating household activities via programs,” in 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2018

work page 2018

[30] [30]

Crowdsourcing task traces for service robotics,

D. Porfirio, A. Saupp ´e, M. Cakmak, A. Albarghouthi, and B. Mutlu, “Crowdsourcing task traces for service robotics,” in ACM/IEEE Int. Conf. on Human-Robot Interact. , 2023

work page 2023

[31] [31]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks,

M. Shridhar et al. , “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition , 2020

work page 2020

[32] [32]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” in Advances in Neural Infor- mation Processing Syst. , 2023

work page 2023

[33] [33]

Plan stability: Replan- ning versus plan repair,

M. Fox, A. Gerevini, D. Long, and I. Serina, “Plan stability: Replan- ning versus plan repair,” in Proc. Int. Conf. on Automated Planning and Scheduling, 2006

work page 2006

[34] [34]

An interaction specification language for robot application development,

D. Porfirio, M. Roberts, and L. M. Hiatt, “An interaction specification language for robot application development,” in 20th ACM/IEEE Int. Conf. on Human-Robot Interact. , 2025

work page 2025