pith. sign in

arxiv: 2606.30457 · v1 · pith:RN7GQAMBnew · submitted 2026-06-29 · 💻 cs.RO

Behavior Prompting Policy: Demonstrations as Prompts for Manipulation

Pith reviewed 2026-06-30 05:18 UTC · model grok-4.3

classification 💻 cs.RO
keywords behavior promptingrobot manipulationin-context learningvisuomotor policyhuman demonstrationtest-time adaptationmanipulation interface
0
0 comments X

The pith

Behavior prompting lets robots perform new manipulation tasks from one human demonstration without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that robots can adapt to unseen tasks at inference time when given a single behavior prompt in the form of a human demonstration. It introduces the Behavior Prompting Policy, an in-context visuomotor architecture that maps the prompt and current observation to actions. Training data diversity, collected through the iPhUMI handheld interface, is identified as the main driver of this generalization capability. New benchmarks DrawAnything and LIBERO-Gen measure performance on novel drawing and tabletop tasks. The approach is presented as a practical way to specify and execute both known and new robot behaviors via demonstrations alone.

Core claim

The central claim is that an in-context visuomotor policy, trained on a diverse set of manipulation tasks, can translate a behavior prompt demonstration together with the current visual observation into robot actions for tasks that were never seen during training, thereby removing the requirement for task-specific fine-tuning.

What carries the argument

Behavior Prompting Policy (BPP), an in-context visuomotor architecture that conditions actions on a single behavior prompt demonstration.

If this is right

  • Robots can complete novel drawing tasks using DrawAnything without additional training.
  • Robots can handle unseen tabletop manipulation scenarios in LIBERO-Gen from a single prompt.
  • Humans can define new robot capabilities at test time by providing one demonstration through iPhUMI.
  • Task diversity during training is the dominant factor enabling prompting over other architectural choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could support iterative teaching loops where a human corrects or extends robot behavior through additional single demonstrations.
  • Scaling the diversity of training tasks may allow the same policy to cover broader classes of manipulation without changes to the architecture.
  • The interface could be adapted to let non-expert users program robots for household or industrial tasks in real time.

Load-bearing premise

Training on tasks collected via iPhUMI supplies enough variety for the in-context model to generalize correctly from one new prompt to an unseen task.

What would settle it

A controlled test in which the trained policy is given a behavior prompt for a held-out task and produces actions that fail to reproduce the demonstrated behavior on that task.

Figures

Figures reproduced from arXiv: 2606.30457 by Austin Patel, Ben Pekarek, Joel Enrique Castro Hernandez, Shuran Song.

Figure 1
Figure 1. Figure 1: Behavior prompting conditions test-time execution on a single human demonstration. This enables a user to specify a task via demonstration (left) or define new robot capabilities (right). 1 Introduction Teaching robots new skills typically requires exhaustive retraining or fine-tuning. In this paper, we propose behavior prompting, a capability that enables robots to perform new tasks at test time given a s… view at source ↗
Figure 2
Figure 2. Figure 2: Behavior Prompting Policy architecture. a) Every ∆t steps of the behavior prompt form a chunk that contains one step of observation and proprio along with ∆t actions. Attention pooling merges {o, q, a} into a chunk embedding pi . The policy consists of: b) a prompt encoder, which extracts relevant prompt information given the current obs, and c) an action decoder, which reasons over the current obs and rel… view at source ↗
Figure 3
Figure 3. Figure 3: Benchmarking Suite. In DrawAnything (a, b), we evaluate whether a policy can recreate a previously unseen drawing at varying board poses given a single human demo. In LIBERO-Gen Combination (c) two identical bowls are randomly positioned, and the robot is given instructions for which one to grasp and where to place it. In LIBERO-Gen Chain (d) we explore the set of two step interactions. We have: 1) first s… view at source ↗
Figure 4
Figure 4. Figure 4: Benchmark Results. For DrawAnything (a,b) we find Goal-Image performs well only on training drawings, while BPP (ours) and ICRT [23] perform well on unseen drawings. Side-by-side qualitative results in (a,b) are for unseen drawings. For unseen manipulation tasks in LIBERO-Gen (c,d,e), BPP outperforms baselines and rivals π0.5 despite not having foundation pretraining. We report the ±1 stdev error bar acros… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt encoder attention (unseen tasks). We visualize the normalized attention scores in the BPP prompt encoder on unseen tasks during rollout (x-axis) to see how the model attends to the prompt (y-axis). For DrawAnything (a,c) the policy attention continuously tracks the portion of the prompt closest to the current observation, while in LIBERO-Gen (b: Combination, d: Chain) the attention tracks discrete “… view at source ↗
Figure 6
Figure 6. Figure 6: BPP ablations on DrawAnything-Sim. We report performance on unseen drawing tasks (3 seeds). [Top row] We ablate what information is included in the prompt and the impact of applying attention pooling to the prompt tokens. [Bottom row] We ablate the composition of our training data. Data Q: What type of training data enables behavior prompting to execute unseen tasks? A1: Task diversity is more important th… view at source ↗
Figure 7
Figure 7. Figure 7: BPP exhibits weak task condi￾tioning under low task diversity. Green is desired action, red is actual execution. While BPP faces challenges in this low task diver￾sity regime, we envision behavior prompting hav￾ing substantial advantages with training data spanning many folding styles across many garments. In partic￾ular, a behavior prompt captures the temporal steps in the folding process as well as the s… view at source ↗
Figure 8
Figure 8. Figure 8: BPP qualitative results on unseen tasks in DrawAnything-Real. We find that BPP is able to reconstruct unseen drawings given a single iPhUMI demo. We show successful drawings (green) and a failure case (red). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: DrawAnything-Real evaluation tasks with representative examples of robot execu￾tions. Green: executed drawing by policy. Red: goal image input rendered as a reference overlay. We observe that Goal-Image can roughly match the structure of the training drawings, but fails to replicate unseen drawings. BPP is able to reconstruct both training and unseen drawings given a single demonstration. B.3 Qualitative c… view at source ↗
Figure 10
Figure 10. Figure 10: DrawAnything-Sim training tasks. We show a subset of 100 of the 2000 procedurally generated tasks (each having 5 demonstrations per task) using a combination of 1 to 6 parts. Parts include lines, Bezier curves, partial/full ovals, and free space (pen up) movement. The parameters ´ for each part (such as start/end position, Bezier control points, proportion of oval, clockwise/counter- ´ clockwise direction… view at source ↗
Figure 11
Figure 11. Figure 11: DrawAnything-Sim evaluation tasks. We hand-collect 50 evaluation tasks that were not seen during training with 5 demonstrations per task at varying board rotations. Tasks have varied complexity and duration. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Initial environments for LIBERO-Gen Combination. There are two identical bowls in each environment. The task involves moving a specified bowl to a specified target location. From each starting initialization, we generate the combinatorial space of all possible bowl pick locations and place locations to generate a task distribution. We withhold ten tasks (combinations of pick-place locations) to use as eva… view at source ↗
Figure 13
Figure 13. Figure 13: Initial environment for LIBERO-Gen Chain. All two-step chained tasks in this exper￾iment start from the single first step state (blue). We also include one step actions starting from the first step state (blue) as well as second step actions which start assuming one action primitive has already been completed (red). The ablation for no second step excludes the single-step tasks that start from the second … view at source ↗
Figure 14
Figure 14. Figure 14: The iPhUMI handheld data collection gripper. iPhUMI enables real-time localization in new environments, which dramatically reduces the setup time required for collecting demonstra￾tion data compared to the original UMI [1]. It features a custom-built application that facilitates data collection and policy deployment. With iPhUMI, a user can also specify behavior prompts at test-time to immediately conditi… view at source ↗
Figure 15
Figure 15. Figure 15: iPhUMI collected modalities. Data collected with bimanual iPhUMI for a) laundry folding and b) drawing visualized part-way through a user drawing the letter A. For DrawAnything￾Real we do not use the ultrawide or depth camera as policy inputs, but include them for reference. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: iPhUMI data collection interface. We provide an interface to perform gripper calibra￾tion, collect demonstrations, and set appropriate settings for data collection. iPhUMI is capable of collecting five types of data from the iPhone simultaneously: main camera (1920x1440 at 60Hz), ultrawide camera (640x480 at 10 Hz), LiDAR depth (256x192 at 60Hz), gripper pose (60Hz), and gripper width (10Hz) ( [PITH_FULL… view at source ↗
Figure 17
Figure 17. Figure 17: iPhUMI Demonstration management interface. This interface lets the user view and manage collected demonstration data. The data can also be exported to an external SD card con￾nected with a USB C adapter. I.2 iPhUMI policy deployment [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: iPhUMI deployment interface. Both USB and Ethernet streaming are supported to stream main camera, ultrawide camera, and LiDAR depth for use during robot deployment. The deployment interface ( [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
read the original abstract

We study behavior prompting, a paradigm that enables robots to perform new tasks at inference time given a single human demonstration, which we call a behavior prompt. To enable this capability, we present contributions in algorithm, data, and evaluation. For algorithm, we introduce Behavior Prompting Policy (BPP), an in-context visuomotor architecture that translates the behavior prompt and the current observation into robot actions. For data, we identify that task diversity is the primary driver of the prompting capability and introduce iPhUMI, a handheld manipulation interface for collecting diverse training data. For evaluation, we introduce DrawAnything and LIBERO-Gen to evaluate test-time adaptation to unseen drawing and tabletop manipulation tasks. We also demonstrate that iPhUMI serves as a practical interface for specifying behavior prompts at test time, enabling a human to command a robot via a single demonstration to complete known tasks or to define new robot capabilities. Altogether, behavior prompting provides a flexible and scalable way to teach robots new skills without the need for expensive fine-tuning. Our project website is located at https://behavior-prompting.github.io/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Behavior Prompting Policy (BPP), an in-context visuomotor architecture that maps a single human demonstration (behavior prompt) plus current observation to robot actions for test-time adaptation. It identifies task diversity as the primary driver of this capability and introduces iPhUMI, a handheld interface for collecting diverse manipulation data. New benchmarks DrawAnything and LIBERO-Gen are proposed to evaluate generalization to unseen drawing and tabletop tasks, with the claim that this enables flexible skill teaching without fine-tuning.

Significance. If the empirical results establish that iPhUMI task diversity is sufficient for in-context generalization on the new benchmarks, the work would offer a scalable alternative to fine-tuning in robot learning, with potential impact on imitation learning paradigms.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'task diversity is the primary driver of the prompting capability' is central to the contribution yet unsupported by any ablation (e.g., fixed model with varying task count or coverage) or quantitative comparison to data quality or model capacity; without such evidence the 'scalable without fine-tuning' claim does not follow.
  2. [Evaluation] Evaluation section (DrawAnything and LIBERO-Gen): no details on prompt encoding, attention mechanism, or error bars are referenced, making it impossible to verify whether observed generalization on unseen prompts is driven by the stated data diversity rather than other factors.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'known tasks or to define new robot capabilities' is used without clarifying the distinction or providing separate metrics for each case.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript introducing Behavior Prompting Policy. We address each major comment below with clarifications and proposed revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'task diversity is the primary driver of the prompting capability' is central to the contribution yet unsupported by any ablation (e.g., fixed model with varying task count or coverage) or quantitative comparison to data quality or model capacity; without such evidence the 'scalable without fine-tuning' claim does not follow.

    Authors: We appreciate this point. Our identification of task diversity as the primary driver is based on comparative experiments in the paper, where training on the diverse iPhUMI dataset enables stronger in-context generalization than less diverse alternatives. However, we agree that dedicated ablations (e.g., varying task count with fixed model and data volume) would provide more direct evidence. In the revised manuscript we will add such an ablation study along with quantitative comparisons to data quality and model capacity to better support the claim. revision: yes

  2. Referee: [Evaluation] Evaluation section (DrawAnything and LIBERO-Gen): no details on prompt encoding, attention mechanism, or error bars are referenced, making it impossible to verify whether observed generalization on unseen prompts is driven by the stated data diversity rather than other factors.

    Authors: Thank you for highlighting this. The BPP architecture, including prompt encoding via the visuomotor backbone and the cross-attention mechanism for conditioning on the behavior prompt, is detailed in Section 3. To address the concern, we will expand the evaluation section to explicitly reference these components when discussing results and will include error bars (standard deviation across seeds) for all metrics on DrawAnything and LIBERO-Gen. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical evaluation of BPP and iPhUMI

full rationale

The paper introduces BPP as an in-context architecture, identifies task diversity via iPhUMI data collection, and evaluates on new benchmarks DrawAnything and LIBERO-Gen. The central claim that diversity enables test-time adaptation without fine-tuning is asserted as an empirical finding from the data and evaluations rather than derived from equations or self-referential definitions. No load-bearing self-citations, fitted inputs renamed as predictions, or reductions of outputs to inputs by construction appear in the provided text. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Central claim rests on the domain assumption that diverse task data enables in-context generalization for visuomotor policies; new method and interface entities are introduced without independent external evidence in the abstract.

axioms (1)
  • domain assumption Task diversity during training is the primary driver of prompting capability for unseen tasks.
    Explicitly identified in the abstract as the key factor enabling behavior prompting.
invented entities (2)
  • Behavior Prompting Policy (BPP) no independent evidence
    purpose: In-context visuomotor architecture translating demonstration prompt and observation to actions.
    New method introduced to realize the prompting paradigm.
  • iPhUMI no independent evidence
    purpose: Handheld interface for collecting diverse manipulation demonstrations.
    New data collection tool proposed in the paper.

pith-pipeline@v0.9.1-grok · 5722 in / 1333 out tokens · 38930 ms · 2026-06-30T05:18:29.419632+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceed- ings of Robotics: Science and Systems (RSS), 2024. URLhttps://arxiv.org/abs/2402. 10329

  2. [2]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  3. [3]

    T. L. Team, J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, N. Kuppuswamy, K.-H. Lee, K. Liu, D. McConachie, I. McMahon, H. Nishimura, C. Phillips-Grafflin, C. Richter, P. Shah, K. Srinivasan, B. Wulfe, C. Xu, M. Zhang, A. Alspach, M. Angeles, K. Arora, V . C. Guizilini, A. Castro, D. C...

  4. [4]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. BC-Z: Zero-shot task generalization with robotic imitation learning. In5th Annual Conference on Robot Learning, 2021. URLhttps://openreview.net/forum?id=8kbp23tSGYv

  5. [5]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  6. [6]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  7. [7]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. 9

  8. [8]

    M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  9. [9]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

  10. [10]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410....

  11. [11]

    C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. InConference on Robot Learning, pages 357–368. PMLR, 2017

  12. [12]

    S. Calinon. A tutorial on task-parameterized movement learning and retrieval.Intelligent Service Robotics, 9(1):1–29, 2016. ISSN 1861-2776. doi:10.1007/s11370-015-0187-9

  13. [13]

    Zhang and A

    X. Zhang and A. Boularias. One-shot imitation learning with invariance matching for robotic manipulation.arXiv preprint arXiv:2405.13178, 2024

  14. [14]

    Valassakis, G

    E. Valassakis, G. Papagiannis, N. D. Palo, and E. Johns. Demonstrate once, imitate immedi- ately (DOME): Learning visual servoing for one-shot imitation learning. InIEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), 2022

  15. [15]

    Y . Duan, M. Andrychowicz, B. Stadie, O. Jonathan Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning.Advances in Neural Information Processing Systems, 30, 2017

  16. [16]

    Papagiannis, N

    G. Papagiannis, N. D. Palo, P. Vitiello, and E. Johns. R+x: Retrieval and execution from ev- eryday human videos. InIEEE International Conference on Robotics and Automation (ICRA), 2025

  17. [17]

    Jiang, A

    Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan. VIMA: General robot manipulation with multimodal prompts. InFortieth Inter- national Conference on Machine Learning, 2023

  18. [18]

    V . Jain, M. Attarian, N. J. Joshi, A. Wahid, D. Driess, Q. Vuong, P. R. Sanketi, P. Sermanet, S. Welker, C. Chan, I. Gilitschenski, Y . Bisk, and D. Dwibedi. Vid2Robot: End-to-end video-conditioned policy learning with cross-attention transformers. InProceedings of (RSS) Robotics Science and Systems. Proceedings of Robotics: Science and Systems, May 2024

  19. [19]

    In: 2022 In- ternational Conference on Robotics and Automation (ICRA)

    Z. Mandi, F. Liu, K. Lee, and P. Abbeel. Towards more generalizable one-shot visual imitation learning. In2022 International Conference on Robotics and Automation (ICRA), pages 2434– 2444, 2022. doi:10.1109/ICRA46639.2022.9812450. 10

  20. [20]

    M. Xu, Y . Shen, S. Zhang, Y . Lu, D. Zhao, B. J. Tenenbaum, and C. Gan. Prompting decision transformer for few-shot policy generalization. InThirty-ninth International Conference on Machine Learning, 2022

  21. [21]

    R. Shah, S. Liu, Q. Wang, Z. Jiang, S. Kumar, M. Seo, R. Mart ´ın-Mart´ın, and Y . Zhu. Mim- icDroid: In-context learning for humanoid manipulation from human play videos.2026 IEEE International Conference on Robotics and Automation (ICRA), 2026

  22. [22]

    Dreczkowski, P

    K. Dreczkowski, P. Vitiello, V . V osylius, and E. Johns. Learning a thousand tasks in a day. Science Robotics, 10(108):eadv7594, 2025. doi:10.1126/scirobotics.adv7594. URLhttps: //www.science.org/doi/abs/10.1126/scirobotics.adv7594

  23. [23]

    L. Fu, H. Huang, G. Datta, L. Y . Chen, W. C.-H. Panitch, F. Liu, H. Li, and K. Goldberg. In-context imitation learning via next-token prediction.International Conference on Robotics and Automation (ICRA), 2025. URLhttps://arxiv.org/abs/2408.15980

  24. [24]

    Perez, F

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  25. [25]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion Policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

  26. [26]

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu. LIBERO-Plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

  27. [27]

    X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. LIBERO-PRO: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827, 2025

  28. [28]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. LoRA: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  29. [29]

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, volume 2024, pages 21875–21895, 2024

  30. [30]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021

  31. [31]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  32. [32]

    Torne, A

    M. Torne, A. Tang, Y . Liu, and C. Finn. Learning long-context diffusion policies via past-token prediction. In9th Annual Conference on Robot Learning, 2025

  33. [33]

    J. Fang, W. Chen, H. Xue, F. Zhou, T. Le, Y . Wang, Y . Zhang, J. Lv, C. Wen, and C. Lu. RoboPocket: Improve robot policies instantly with your phone.arXiv preprint arXiv:2603.05504, 2026. URLhttps://arxiv.org/abs/2603.05504

  34. [34]

    Etukuru, N

    H. Etukuru, N. Naka, Z. Hu, S. Lee, J. Mehu, A. Edsinger, C. Paxton, S. Chintala, L. Pinto, and N. M. M. Shafiullah. Robot utility models: General policies for zero-shot deployment in new environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8275–8283. IEEE, 2025. 11

  35. [35]

    X. Xu, J. Park, H. Zhang, E. Cousineau, A. Bhat, J. Barreiros, D. Wang, J. Bohg, and S. Song. HoMMI: Learning whole-body mobile manipulation from human demonstrations. arXiv preprint arXiv:2603.03243, 2026

  36. [36]

    Z. Liu, C. Chi, E. Cousineau, N. Kuppuswamy, B. Burchfiel, and S. Song. Maniwav: Learning robot manipulation from in-the-wild audio-visual data. InConference on Robot Learning, pages 947–962. PMLR, 2025

  37. [38]

    URLhttps://arxiv.org/abs/2604.18933. 12 Appendix A Case Study: Laundry Folding Behavior prompts are a substantially more complex task representation than fixed-length language embeddings, and we want to understand whether this complexity poses challenges when we have lowtask diversity. To study this, we perform a case study with three sweater folding task...