pith. sign in

arxiv: 2606.12475 · v1 · pith:QAKK2QZKnew · submitted 2026-06-10 · 💻 cs.RO

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

Pith reviewed 2026-06-27 09:55 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelshuman-robot collaborationimitation learningaction chunkingcollaborative manipulationinference-time steeringimplicit collaborationpremature assistance
0
0 comments X

The pith

Vision-language-action models support implicit human-robot collaboration when inference-time steering corrects premature assistance caused by action chunking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that end-to-end trained vision-language-action models can perform collaborative manipulation tasks between humans and robots without explicit signals or custom pipelines. A central problem is that action-chunking policies leak future demonstration actions across task transitions, prompting the robot to assist too early, such as by offering a tool before the person is ready. The authors introduce an inference-time steering method that adjusts outputs to reduce these errors while retaining the model's core capabilities. This change makes longer action horizons practical, and a study with 16 participants on an assembly task showed faster completion and fewer failures than policies limited to shorter horizons. A reader would care because the result points to scalable learning methods that could handle flexible teamwork in unstructured settings.

Core claim

Models trained end-to-end with imitation learning on vision-language-action data can support collaborative manipulation. Action-chunking policies exhibit a failure mode in which demonstration action leakage across latent task transitions produces premature assistive behavior. This issue grows with longer execution horizons and appears in real-world settings. An inference-time steering method mitigates the erroneous actions without degrading performance on the target distribution. In a 16-participant user study on a long-horizon collaborative assembly task, steering permits longer horizons that produce faster collaboration and fewer failures than a shorter-horizon policy.

What carries the argument

Inference-time steering applied to action-chunking VLA policies to suppress premature assistive actions arising from demonstration action leakage.

If this is right

  • Longer execution horizons become usable in collaborative VLA systems without raising rates of premature assistance.
  • End-to-end imitation learning is sufficient for implicit human-robot collaboration in manipulation tasks.
  • Real-world VLA systems can avoid premature tool handovers and similar errors through the steering adjustment.
  • The leakage failure mode is tied directly to chunk length and appears consistently across evaluated models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The steering approach could extend to other sequential prediction settings where training data contains predictable future states that a model might over-anticipate.
  • Online monitoring of human readiness signals might be combined with steering to handle more variable partner behavior.
  • Applying the correction during training rather than only at inference could reduce the need for post-hoc adjustments in new tasks.

Load-bearing premise

That demonstration action leakage across latent transitions is the dominant source of premature assistance and that steering corrects it without introducing new failure modes or reducing performance on the intended tasks.

What would settle it

A user study result in which the steered longer-horizon policy produces more failures or slower task times than the unsteered shorter-horizon baseline would falsify the performance benefit.

Figures

Figures reproduced from arXiv: 2606.12475 by Alex Cuellar, Leo Xu, Letian Li, Michael Hagenow.

Figure 1
Figure 1. Figure 1: Collaborative VLA approach. We augment VLAs with collaborative data and inference [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental setup and tasks (showing both the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual depiction of how action chunking from expert demonstrations can lead to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Steering Approach. Steer￾ing denoises with higher reward for closer-to-basin trajectories and pulling further encourages waiting [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Waiting data experiments. In￾creased Ta induces false commitments and drift, corrected through steering. Experiment. To study the effects of steering, we run the sped-up π0.5 for five minutes in a state prior to handover (e.g., holding a screwdriver) with an idle collaborator. We examine false commitments and model drift under differ￾ent execution horizons with an ablation over the steering interventions. … view at source ↗
Figure 6
Figure 6. Figure 6: User study results. as our study conditions to see how the different resulting execution horizons impact collaborative performance and perception. Both conditions use the sped-up fine-tuned π0.5 policy with Tp = 16. Condition order was counterbalanced across participants. • Shorter Execution Horizon (4). Our first condition uses a short execution horizon that minimizes false starts and drift by frequently … view at source ↗
Figure 7
Figure 7. Figure 7: Mistakes by subtask: handover vs. manipulation. 20% 40% 60% 80% 100% Training data fraction 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Failures per episode Handover vs. manipulation failures by training-data scale Handover Manipulation [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Total mistakes and failures per episode vs. training-data fraction. training set. As seen in Figures 7 and 8, the majority of mistakes and failures arose from handovers rather than manipulation inaccuracies. Notably, even with only 20 demonstrations, π0.5 makes no manipulation errors. We attribute this to π0.5’s large-scale manipulation pretraining, whereas we hypothesize there is no pretraining on human-r… view at source ↗
Figure 11
Figure 11. Figure 11: Gaussian KDE of per-timestep jerk magnitude (m/s3 ). D.4 User Evaluation Here we provide additional detail around the experimental procedure as well as extended study results across objective and subjective measures. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: User study time results broken down by total time and times when the robot is moving. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Robot policy mistakes that were recovered by the policy (only introducing inefficiency) [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Manipulation failures broken down by the specific object and interaction. Failures [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Counts of other collaboration related events in rollouts. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: NASA-TLX Broken down by subscale. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: System Usability Scale (SUS) broken down by item. [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Human-Robot Fluency, Trust in Robot, and Positive Teammate Traits, broken down by [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Working Alliance, broken down by question. [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
read the original abstract

Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that vision-language-action (VLA) models trained end-to-end via imitation learning can enable implicit human-robot collaboration. It identifies demonstration action leakage in action-chunking policies as causing premature assistance (increasing with longer horizons), proposes an inference-time steering method to mitigate this while preserving performance, and reports that a 16-participant user study on a long-horizon assembly task shows steering yields faster collaboration and fewer failures than a shorter-horizon baseline.

Significance. If validated, the work shows a path to scalable learned HRC without hand-engineered pipelines by diagnosing and correcting a specific failure mode of chunked VLAs. The identification of leakage across latent transitions and the steering intervention, together with the real-world user study, would be a concrete contribution to making imitation-learned policies viable for collaborative manipulation.

major comments (2)
  1. [User study / empirical evaluation] User study (abstract and empirical evaluation): the 16-participant study is presented as showing quantitative gains from steering, yet reports no error bars, statistical tests, details on steering implementation, or data-collection protocol. This leaves the central empirical claim without verifiable support for robustness or effect size.
  2. [Proposed inference-time steering] Steering method description: the claim that steering removes leakage-induced premature assistance without new failure modes or capability loss on the target distribution rests on the unablated user-study outcomes. No controlled isolation (e.g., synthetic trajectories with known transition points or ablation on non-leakage rollouts) is described to establish that leakage is the dominant driver or that steering specifically corrects it.
minor comments (1)
  1. [Abstract / Method] The abstract and method sections should explicitly name the two state-of-the-art VLA models evaluated and the precise metrics (e.g., task completion time, failure types) used to quantify "faster collaboration and fewer failures."

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional details and analysis would strengthen the empirical claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [User study / empirical evaluation] User study (abstract and empirical evaluation): the 16-participant study is presented as showing quantitative gains from steering, yet reports no error bars, statistical tests, details on steering implementation, or data-collection protocol. This leaves the central empirical claim without verifiable support for robustness or effect size.

    Authors: We agree that these details are necessary to support the claims. In the revised manuscript we will add error bars to all user-study metrics, report statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values), provide the precise implementation details and hyperparameters of the inference-time steering procedure, and expand the experimental protocol section with participant demographics, recruitment, consent, task instructions, and data-collection procedures. These additions will be placed in the empirical evaluation section and will allow readers to assess robustness and effect sizes directly. revision: yes

  2. Referee: [Proposed inference-time steering] Steering method description: the claim that steering removes leakage-induced premature assistance without new failure modes or capability loss on the target distribution rests on the unablated user-study outcomes. No controlled isolation (e.g., synthetic trajectories with known transition points or ablation on non-leakage rollouts) is described to establish that leakage is the dominant driver or that steering specifically corrects it.

    Authors: The primary validation in the paper is the real-world 16-participant study because that is the target setting for implicit collaboration. We acknowledge, however, that the manuscript does not include controlled ablations isolating leakage. In revision we will add an analysis subsection (or appendix) that applies the steering method to held-out demonstration trajectories with annotated transition points to quantify its effect on premature actions. If space constraints prevent a full new experiment, we will explicitly note this as a limitation while retaining the user-study results as the main evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: results from independent training and user study

full rationale

The paper reports empirical outcomes from end-to-end VLA training plus a 16-participant user study comparing steered longer-horizon policies against shorter-horizon baselines. No equations, fitted parameters, or derivations are shown that reduce the claimed performance gains (faster collaboration, fewer failures) to quantities defined or fitted on the same data. Failure-mode identification and steering proposal are presented as observations, not self-referential constructions. No load-bearing self-citations or uniqueness theorems appear in the provided text. This is the common case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work inherits standard imitation-learning and VLA architecture assumptions from prior literature; the abstract introduces no new free parameters, axioms, or invented entities beyond the steering heuristic itself.

pith-pipeline@v0.9.1-grok · 5739 in / 1131 out tokens · 16678 ms · 2026-06-27T09:55:38.187962+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 13 linked inside Pith

  1. [1]

    Ajoudani, A

    A. Ajoudani, A. M. Zanchettin, S. Ivaldi, A. Albu-Sch ¨affer, K. Kosuge, and O. Khatib. Progress and prospects of the human–robot collaboration.Autonomous robots, 42(5):957–975, 2018

  2. [2]

    Rrapi, B

    F. Rrapi, B. Portelli, G. Serra, and L. Scalera. From ai foundations to large language models: A survey on challenges and opportunities in collaborative robotics.Robotics and Computer- Integrated Manufacturing, 100:103269, 2026

  3. [3]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  4. [4]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    Dasari, O

    S. Dasari, O. Mees, S. Zhao, M. K. Srirama, and S. Levine. The ingredients for robotic dif- fusion transformers. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15617–15625. IEEE, 2025

  6. [6]

    E. E. Entin and D. Serfaty. Adaptive team coordination.Human factors, 41(2):312–325, 1999

  7. [7]

    R. Rico, M. S ´anchez-Manzanares, F. Gil, and C. Gibson. Team implicit coordination processes: A team knowledge–based approach.Academy of management review, 33(1):163–184, 2008

  8. [8]

    B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009

  9. [9]

    Kawaharazuka, J

    K. Kawaharazuka, J. Oh, J. Yamada, I. Posner, and Y . Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025

  10. [10]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  11. [11]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  12. [12]

    J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava. mimic-video: Video- action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

  13. [13]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  14. [14]

    Bommasani, D

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 9

  15. [15]

    Hagenow, M

    M. Hagenow, M. Selvaggio, X. Yu, Y . Wang, Y . Demiris, A. Bobu, Y . Du, H. Soh, D. Losey, and J. Shah. Shared control/autonomy: A historical perspective, current trends, and the role of generative ai.Authorea Preprints, 2025

  16. [16]

    L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections.arXiv preprint arXiv:2403.12910, 2024

  17. [17]

    F. Lin, R. Nai, Y . Hu, J. You, J. Zhao, and Y . Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

  18. [18]

    Y . Chen, K. Gu, Y . Wen, Y . Zhao, T. Wang, and L. Nie. Intentionvla: Generalizable and efficient embodied intention reasoning for human-robot interaction.arXiv preprint arXiv:2510.07778, 2025

  19. [19]

    L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language- action models.arXiv preprint arXiv:2502.19417, 2025

  20. [20]

    Yoneda, L

    T. Yoneda, L. Sun, G. Yang, B. Stadie, and M. Walter. To the noise and back: Diffusion for shared autonomy.arXiv preprint arXiv:2302.12244, 2023

  21. [21]

    L. Sun, J. Ji, X. Tan, and M. Walter. Flashback: Consistency model-accelerated shared auton- omy. InConference on Robot Learning, pages 924–940. PMLR, 2025

  22. [22]

    A. Wang, X. Yan, B. McMahan, M. Zhou, Y . Yuan, J. Y . Lee, A. Shreif, M. Li, Z. Peng, B. Zhou, et al. Disco: Diffusion sequence copilots for shared autonomy. InProceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction, pages 982–990, 2026

  23. [23]

    T. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023

  24. [24]

    D. Jing, G. Wang, J. Liu, W. Tang, Z. Sun, Y . Yao, Z. Wei, Y . Liu, Z. Lu, and M. Ding. Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

  25. [25]

    H. Wang, G. Zhang, Y . Yan, R. R. Kompella, and G. Liu. Vla knows its limits.arXiv preprint arXiv:2602.21445, 2026

  26. [26]

    Matheson, R

    E. Matheson, R. Minto, E. G. Zampieri, M. Faccio, and G. Rosati. Human–robot collaboration in manufacturing applications: a review.Robotics, 8(4):100, 2019

  27. [27]

    J. A. Marvel, S. Bagchi, M. Zimmerman, and B. Antonishek. Towards effective interface designs for collaborative hri in manufacturing: Metrics and measures.ACM Transactions on Human-Robot Interaction (THRI), 9(4):1–55, 2020

  28. [28]

    Ortenzi, A

    V . Ortenzi, A. Cosgun, T. Pardi, W. P. Chan, E. Croft, and D. Kuli ´c. Object handovers: a review for robotics.IEEE Transactions on Robotics, 37(6):1855–1873, 2021

  29. [29]

    Hoffman, M

    G. Hoffman, M. Cakmak, and C. Chao. Timing in human-robot interaction. InProceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pages 509–510, 2014

  30. [30]

    D. P. Losey, C. G. McDonald, E. Battaglia, and M. K. O’Malley. A review of intent detection, arbitration, and communication aspects of shared control for physical human–robot interaction. Applied Mechanics Reviews, 70(1):010804, 2018

  31. [31]

    Candon, Q

    K. Candon, Q. Zhang, A. Lew, H. Claure, L. Qian, A. Quarles, C. Sarkar, and M. V ´azquez. Learning human preferences over a human-robot collaboration based on explicit and implicit human feedback. InProceedings of the 21st ACM/IEEE International Conference on Human- Robot Interaction, pages 1040–1049, 2026. 10

  32. [32]

    Zhong, B

    R. Zhong, B. Hu, Z. Liu, Q. Qin, Y . Feng, X. V . Wang, L. Wang, and J. Tan. A two-stage framework for learning human-to-robot object handover policy from 4d spatiotemporal flow. Robotics and Computer-Integrated Manufacturing, 98:103171, 2026

  33. [33]

    Cuellar, M

    A. Cuellar, M. Hagenow, and J. Shah. Multi-cycle spatio-temporal adaptation in human-robot teaming.arXiv preprint arXiv:2604.19670, 2026

  34. [34]

    J. F. Fisac, A. Bajcsy, S. L. Herbert, D. Fridovich-Keil, S. Wang, C. J. Tomlin, and A. D. Dragan. Probabilistically safe robot planning with confidence-based human predictions.arXiv preprint arXiv:1806.00109, 2018

  35. [35]

    S. Li, N. Figueroa, A. Shah, and J. Shah. Provably safe and efficient motion planning with uncertain human dynamics. 2021

  36. [36]

    Cuellar, C

    A. Cuellar, C. K. Fourie, and J. A. Shah. An alignment-based approach to learning motions from demonstrations.IEEE Robotics and Automation Letters, 2025

  37. [37]

    Huang, M

    C.-M. Huang, M. Cakmak, and B. Mutlu. Adaptive coordination strategies for human-robot handovers. InRobotics: science and systems, volume 11, pages 1–10. Rome, Italy, 2015

  38. [38]

    M. Kim, S. Yang, B. Kim, J. Kim, and D. Kim. Human-to-robot handover based on reinforce- ment learning.Sensors, 24(19):6275, 2024

  39. [39]

    H. Duan, P. Wang, Y . Li, D. Li, and W. Wei. Learning human-to-robot dexterous handovers for anthropomorphic hand.IEEE Transactions on Cognitive and Developmental Systems, 15 (3):1224–1238, 2022

  40. [40]

    E. Ng, Z. Liu, and M. Kennedy. Diffusion co-policy for synergistic human-robot collaborative tasks.IEEE Robotics and Automation Letters, 9(1):215–222, 2023

  41. [41]

    S. Li, J. Wang, R. Dai, W. Ma, W. Y . Ng, Y . Hu, and Z. Li. Robonurse-vla: Robotic scrub nurse system based on vision-language-action model. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3986–3993. IEEE, 2025

  42. [42]

    B. An, C. Yang, and R. Katzschmann. Robotic assistant: Completing collaborative tasks with dexterous vision-language-action models.arXiv preprint arXiv:2510.25713, 2025

  43. [43]

    X. Wang, X. Dengxiong, S. Bai, P. Zheng, and Y . Zhang. Vlabot: A human vision–language– action models interaction framework for robotic assembly.Robotics and Computer-Integrated Manufacturing, 100:103268, 2026

  44. [44]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  45. [45]

    Barreiros, A

    J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

  46. [46]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  47. [47]

    Brown, W

    D. Brown, W. Goo, P. Nagarajan, and S. Niekum. Extrapolating beyond suboptimal demon- strations via inverse reinforcement learning from observations. InInternational conference on machine learning, pages 783–792. PMLR, 2019. 11

  48. [48]

    J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. Vlash: Real-time vlas via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

  49. [49]

    N. R. Arachchige, Z. Chen, W. Jung, W. C. Shin, R. Bansal, P. Barroso, Y . H. He, Y . C. Lin, B. Joffe, S. Kousik, et al. Sail: Faster-than-demonstration execution of imitation learning policies.arXiv preprint arXiv:2506.11948, 2025

  50. [50]

    Singhal, Z

    R. Singhal, Z. Horvitz, R. Teehan, M. Ren, Z. Yu, K. McKeown, and R. Ranganath. A gen- eral framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

  51. [51]

    Fukunaga and L

    K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition.IEEE Transactions on information theory, 21(1):32–40, 1975

  52. [52]

    S. G. Hart. Nasa-task load index (nasa-tlx); 20 years later. InProceedings of the human factors and ergonomics society annual meeting, volume 50, pages 904–908. Sage publications Sage CA: Los Angeles, CA, 2006

  53. [53]

    Brooke et al

    J. Brooke et al. Sus-a quick and dirty usability scale.Usability evaluation in industry, 189 (194):4–7, 1996

  54. [54]

    G. Hoffman. Evaluating fluency in human–robot collaboration.IEEE Transactions on Human- Machine Systems, 49(3):209–218, 2019

  55. [55]

    A. D. Dragan, K. C. Lee, and S. S. Srinivasa. Legibility and predictability of robot motion. In2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 301–308. IEEE, 2013

  56. [56]

    H. Shi, B. Xie, Y . Liu, L. Sun, F. Liu, T. Wang, E. Zhou, H. Fan, X. Zhang, and G. Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic ma- nipulation.arXiv preprint arXiv:2508.19236, 2025

  57. [57]

    Torne, K

    M. Torne, K. Pertsch, H. Walke, K. Vedder, S. Nair, B. Ichter, A. Z. Ren, H. Wang, J. Tang, K. Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596, 2026

  58. [58]

    Hu, J.-N

    Y . Hu, J.-N. Zaech, N. Nikolov, Y . Yao, S. Dey, G. Albanese, R. Detry, L. Van Gool, and D. Paudel. Ar-vla: True autoregressive action expert for vision-language-action models.arXiv preprint arXiv:2603.10126, 2026

  59. [59]

    Swamy, S

    G. Swamy, S. Choudhury, D. Bagnell, and S. Wu. Causal imitation learning under tempo- rally correlated noise. InInternational Conference on Machine Learning, pages 20877–20890. PMLR, 2022

  60. [60]

    K. Mark, L. Galustian, M. P.-P. Kovar, and E. Heid. Feynman-kac-flow: Inference steering of conditional flow matching to an energy-tilted posterior.arXiv preprint arXiv:2509.01543, 2025

  61. [61]

    Bangor, P

    A. Bangor, P. T. Kortum, and J. T. Miller. An empirical evaluation of the system usability scale. Intl. Journal of Human–Computer Interaction, 24(6):574–594, 2008. 12 A Model Architecture and Training For all policies, while demonstrations were recorded at 20 Hz, all training was done with a down- sampled 10 Hz feed and images resized to224×224pixels. We ...