pith. sign in

arxiv: 2606.10267 · v1 · pith:4SSYDFXXnew · submitted 2026-06-09 · 💻 cs.RO · cs.AI· cs.LG

What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents

Pith reviewed 2026-06-27 13:35 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords hierarchical VLArobot manipulationvision-language-actionoptions frameworkplanner-controller interfacesubgoal switchingtask decomposition
0
0 comments X

The pith

Specific choices for planners, controllers, switching rules, and representations make hierarchical VLA agents outperform both flat VLA policies and naive hierarchies on robot manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper unifies existing hierarchical vision-language-action systems under an options-style framework and runs controlled benchmarks on short-horizon, long-horizon, and reasoning-intensive manipulation tasks. It isolates the effects of planner choice, controller choice, subgoal switching, and observation-memory design. The experiments show that certain combinations of these elements produce higher success rates than either monolithic VLA control or ad-hoc hierarchies. The same patterns hold in simulation and on a physical ALOHA robot arm.

Core claim

By testing representative Hi-VLA agents inside a shared options framework, the study shows that model choices and interface mechanisms jointly determine performance, and that applying the resulting design principles produces a substantially stronger system than flat VLA control or a naively designed hierarchy across the tested task categories.

What carries the argument

An options-style control framework that separates high-level VLM planning from low-level VLA execution and exposes explicit interfaces for subgoal generation, switching, and observation representation.

If this is right

  • Higher task success rates on long-horizon and reasoning-heavy manipulation problems
  • Improved transfer from simulation to physical robot hardware
  • More reliable subgoal decomposition and execution in hierarchical setups
  • A reusable template for constructing future VLA agents

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interface principles could be tested on non-manipulation robot domains such as navigation or mobile manipulation.
  • Larger-scale models might amplify or diminish the measured effects of the identified design choices.
  • The framework could serve as a testbed for comparing future planner or controller architectures without redesigning the overall hierarchy.
  • The performance gap between tuned and naive hierarchies suggests that interface design may matter more than raw model scale in some settings.

Load-bearing premise

That the tested design axes and the chosen short-horizon, long-horizon, and reasoning-intensive tasks capture the main factors that determine success in robot manipulation.

What would settle it

Running the same principle-derived system and the two baselines on a new manipulation task or robot platform and finding no performance advantage for the principle-derived system.

Figures

Figures reproduced from arXiv: 2606.10267 by Annie Xie, Caden Lu, Dhruv Shah, Hao-Tien Lewis Chiang, Jiaheng Hu, Jie Tan, Mohit Shridhar.

Figure 1
Figure 1. Figure 1: Hierarchical VLA systems have the potential to compensate for the deficiencies of the low-level VLA by generating suitable commands, thereby achieving compositional generalization, especially for long-horizon and reasoning tasks. In this paper, we study the key design choices of Hi-VLA systems, towards a better understanding of how and why they impact the overall performance. and the low-level VLA, the obs… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our experimental results. In these plots, we visualize how different design choices increase (red) or decrease (blue) the overall performance of hierarchical VLA systems on different types of tasks. We show the detailed results in Appendix J. In the following sections, we present the results and analysis for a comprehensive set of controlled experiments designed to systematically evaluate th… view at source ↗
Figure 3
Figure 3. Figure 3: Change in success rates after adding VLM thinking. Across all task categories and models, VLM inference with thinking enabled consistently outperforms the counterpart ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: VLA Exec. Horizon 0% 10% 30% 50% Detection Error Rate 30 40 50 60 70 Avg. Success Rate (%) 62.1% 62.2% 54.5% 60.5% 45.3% 57.1% False Positive error False Negative error 60.2% [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Success Detection Error What is the effect of the VLA execution horizon? An im￾portant hyperparameter for Fixed Frequency Termination is the execution horizon, which controls the number of low-level VLA steps before handing control back to the VLM. We conduct ad￾ditional experiments evaluating this important hyper-parameter, and present the results in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Observation representation pipeline. We study three ways of converting the raw image observation to text: (1) querying a VLM naively, (2) incorporating (VLM-generated) bounding box information to the query, and (3) incorporating privileged contact information to the query. could be due to the phenomenon that VLMs tend to ignore image inputs as task becomes harder [31], which is why passing in additional te… view at source ↗
Figure 7
Figure 7. Figure 7: Motion sequence of the real ALOHA robot. Orchestration allows the robot to recover from [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of different hierarchies with a scripted low-level policy. In this section, we experiment on how potential im￾provements in the VLA’s action predictions may af￾fect our conclusions. First of all, note that for a “perfect” VLA, hierarchi￾cal systems are almost meaningless, since it should be able to directly complete any given task without orchestration. However, we believe that a more real￾isti… view at source ↗
Figure 9
Figure 9. Figure 9: Example scenes from our study. Each scene is intentionally designed to support multiple [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner. In this paper, we present a systematic study of Hi-VLA design for robot manipulation. We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot. Overall, our results provide a foundation for building more capable, robust, and principled hierarchical VLA agents. More information and video at jiahenghu.github.io/hi-vla.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a systematic empirical study of hierarchical vision-language-action (Hi-VLA) agents for robot manipulation. It unifies representative Hi-VLA systems under an options-style control framework, benchmarks core design choices (planner/controller selection, switching mechanisms, observation/memory representations) across short-horizon, long-horizon, and reasoning-intensive tasks, distills practical principles from the results, and demonstrates that applying these principles produces a system outperforming both flat VLA baselines and naively designed hierarchies, with validation in simulation and on a physical ALOHA robot.

Significance. If the results hold under rigorous statistical scrutiny, this work offers concrete, actionable design guidelines for an emerging class of hierarchical robot policies that combine VLMs and VLAs. The unification into a common options framework and the inclusion of real-robot experiments are clear strengths that could help move the field beyond ad-hoc Hi-VLA designs toward more reproducible and effective systems.

major comments (2)
  1. [§4] §4 (Experiments) and abstract: Performance gains for the principle-derived system are reported without any mention of the number of trials per condition, statistical tests, error bars, or data exclusion criteria. Given the high variance typical of VLM-based robot policies, this omission makes it impossible to assess whether the claimed 'substantially stronger' performance is statistically reliable or reproducible.
  2. [§3, §5] §3 (Framework) and §5 (Principles): The manuscript benchmarks the four chosen design axes and three task categories but provides no independent argument or ablation showing that these axes are the dominant determinants of Hi-VLA success or that the selected tasks adequately cover representative challenges such as recovery from execution errors. If untested factors (e.g., base model scale or training regime) drive most gains, the distilled principles may not generalize beyond the experimental slice.
minor comments (2)
  1. [Figure 1, §3] Figure 1 and §3: A clearer diagram of the unified options framework, explicitly labeling the interfaces between planner, controller, and switching mechanism, would improve readability.
  2. [§6.2] §6.2 (Real-robot experiments): The description of the ALOHA setup would benefit from explicit mention of the observation rate, action frequency, and any sim-to-real transfer details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical reporting and the scope of our design study. We address each major comment below and will revise the manuscript to strengthen clarity and rigor.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and abstract: Performance gains for the principle-derived system are reported without any mention of the number of trials per condition, statistical tests, error bars, or data exclusion criteria. Given the high variance typical of VLM-based robot policies, this omission makes it impossible to assess whether the claimed 'substantially stronger' performance is statistically reliable or reproducible.

    Authors: We agree that explicit reporting of trial counts, variance, and statistical tests is necessary for assessing reliability. The experiments used 10 independent trials per condition in simulation and 5 trials on the physical ALOHA robot; all trials were included with no exclusion criteria. We will revise §4 and the abstract to report these numbers, add error bars to all bar plots, and include paired t-test results comparing the principle-derived system against baselines. This addresses the concern directly without altering the underlying data. revision: yes

  2. Referee: [§3, §5] §3 (Framework) and §5 (Principles): The manuscript benchmarks the four chosen design axes and three task categories but provides no independent argument or ablation showing that these axes are the dominant determinants of Hi-VLA success or that the selected tasks adequately cover representative challenges such as recovery from execution errors. If untested factors (e.g., base model scale or training regime) drive most gains, the distilled principles may not generalize beyond the experimental slice.

    Authors: The four axes (planner/controller selection, switching mechanisms, observation/memory representations) were chosen because they directly correspond to the primary architectural differences among the representative Hi-VLA systems unified under the options framework in §3; our study therefore isolates the impact of these interface choices while holding base models fixed. The three task categories were selected to probe short-horizon execution, long-horizon sequencing, and reasoning demands. We acknowledge that recovery from execution errors and model-scale effects are important and were not exhaustively ablated; we will add a limitations paragraph in §5 explicitly noting these boundaries and the rationale for focusing on controllable design axes rather than claiming exhaustive coverage. revision: partial

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivation chain or fitted predictions

full rationale

The paper is a systematic empirical study that unifies Hi-VLA agents under an options framework and benchmarks design choices (planner, controller, switching, observation/memory) across task categories on held-out simulation and real-robot experiments. No mathematical derivations, equations, parameter fittings, or predictions that reduce to inputs by construction are present. Claims rest on measured performance differences rather than self-definitional steps or self-citation chains. This matches the default case of a self-contained empirical paper against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical study of existing model families; no free parameters, mathematical axioms, or invented entities are introduced or fitted.

pith-pipeline@v0.9.1-grok · 5774 in / 1144 out tokens · 27113 ms · 2026-06-27T13:35:17.013210+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 20 linked inside Pith

  1. [1]

    G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  2. [2]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  3. [3]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  4. [4]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  5. [5]

    J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

  6. [6]

    Abdolmaleki, S

    A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Bewley, J. Bingham, M. Bloesch, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer. arXiv preprint arXiv:2510.03342, 2025

  7. [7]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  8. [8]

    Bjorck, F

    NVIDIA, :, J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. ...

  9. [9]

    Belkhale, T

    S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024

  10. [10]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  11. [11]

    L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language- action models.arXiv preprint arXiv:2502.19417, 2025. 9

  12. [12]

    Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li, et al. Hamster: Hierarchical action models for open-world robot manipulation.arXiv preprint arXiv:2502.05485, 2025

  13. [13]

    A. Figure. Helix: A vision-language-action model for generalist humanoid control.Figure AI News, 2024

  14. [14]

    Kahneman

    D. Kahneman. Thinking, fast and slow.Farrar, Straus and Giroux, 2011

  15. [15]

    H. Tan, X. Hao, C. Chi, M. Lin, Y . Lyu, M. Cao, D. Liang, Z. Chen, M. Lyu, C. Peng, et al. Roboos: A hierarchical embodied framework for cross-embodiment and multi-agent collaboration.arXiv preprint arXiv:2505.03673, 2025

  16. [16]

    R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999

  17. [17]

    R. M. French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999

  18. [18]

    R. Brooks. A robust layered control system for a mobile robot.IEEE journal on robotics and automation, 2(1):14–23, 2003

  19. [19]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

  20. [20]

    J. Hu, P. Stone, and R. Martín-Martín. Slac: Simulation-pretrained latent action space for whole-body real-world rl.arXiv preprint arXiv:2506.04147, 2025

  21. [21]

    Z. Su, B. Zhang, N. Rahmanian, Y . Gao, Q. Liao, C. Regan, K. Sreenath, and S. S. Sastry. Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

  22. [22]

    M. Fu, J. Yu, K. El-Refai, E. Kou, H. Xue, H. Huang, W. Xiao, G. Wang, F.-F. Li, G. Shi, J. Wu, S. Sastry, Y . Zhu, K. Goldberg, and L. J. Fan. Cap-x: A framework for benchmarking and improving coding agents for robot manipulation. 2026. URL https: //api.semanticscholar.org/CorpusID:286770427

  23. [23]

    J. Shi, R. Yang, K. Chao, S. Wan, Y . S. Shao, J. Lei, J. Qian, L. Le, P. Chaudhari, K. Daniilidis, C. Wen, and D. Jayaraman. Maestro: Orchestrating robotics modules with vision-language models for zero-shot generalist robots.ArXiv, abs/2511.00917, 2025. URL https://api. semanticscholar.org/CorpusID:282738665

  24. [24]

    Jiang, T

    T. Jiang, T. Yuan, Y . Liu, C. Lu, J. Cui, X. Liu, S. Cheng, J. Gao, H. Xu, and H. Zhao. Galaxea open-world dataset and g0 dual-system vla model, 2025. URL https://arxiv.org/abs/ 2509.00576

  25. [25]

    P. Ding, J. Ma, X. Tong, B. Zou, X. Luo, Y . Fan, T. Wang, H. Lu, P. Mo, J. Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025

  26. [26]

    Comanici, E

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  27. [27]

    Intelligence, B

    P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokin- sky, S. Cao, T. Charbonnier, V . Choudhary, F. Collins, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, M. Dhaka, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glos- sop, T. Godden, I. Goryachev, L. Groom, H. Habeeb, H. Hancock, K. Hausman, G. H...

  28. [28]

    Huang, J

    S. Huang, J. Shao, K. Wang, Q. Chen, J. Sun, Y . Guo, M. Schwager, and J. Bohg. Breaking lock-in: Preserving steerability under low-data vla post-training. 2026. URL https://api. semanticscholar.org/CorpusID:287777164

  29. [29]

    W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierar- chical control.ArXiv, abs/2602.13193, 2026. URL https://api.semanticscholar.org/ CorpusID:285606737

  30. [30]

    Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. De Freitas, and S. Cabi. Vision-language models as success detectors.arXiv preprint arXiv:2303.07280, 2023

  31. [31]

    Majumdar, A

    A. Majumdar, A. Ajay, X. Zhang, P. Putta, S. Yenamandra, M. Henaff, S. Silwal, P. Mcvay, O. Maksymets, S. Arnaud, et al. Openeqa: Embodied question answering in the era of foundation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488–16498, 2024

  32. [32]

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  33. [33]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36: 8634–8652, 2023

  34. [34]

    J. Hu, J. Shim, C. Tang, Y . Sung, B. Liu, P. Stone, and R. Martin-Martin. Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning, 2026. URLhttps://arxiv.org/abs/2603.11653

  35. [35]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  36. [36]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

  37. [37]

    J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  38. [38]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  39. [39]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024. 11

  40. [40]

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  41. [41]

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

  42. [42]

    Driess, F

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, et al. Palm-e: An embodied multimodal language model.arXiv preprint, 2023

  43. [43]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  44. [44]

    Zhong, F

    Y . Zhong, F. Bai, S. Cai, X. Huang, Z. Chen, X. Zhang, Y . Wang, S. Guo, T. Guan, K. N. Lui, Z. Qi, Y . Liang, Y . Chen, and Y . Yang. A survey on vision-language-action models: An action tokenization perspective, 2025. URLhttps://arxiv.org/abs/2507.01925

  45. [45]

    Sapkota, Y

    R. Sapkota, Y . Cao, K. I. Roumeliotis, and M. Karkee. Vision-language-action models: Concepts, progress, applications and challenges, 2025. URLhttps://arxiv.org/abs/2505.04769

  46. [46]

    Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King. A survey on vision-language-action models for embodied ai, 2025. URLhttps://arxiv.org/abs/2405.14093

  47. [47]

    Guruprasad, H

    P. Guruprasad, H. Sikka, J. Song, Y . Wang, and P. P. Liang. Benchmarking vision, language, and action models on robotic learning tasks, 2024. URL https://arxiv.org/abs/2411.05821

  48. [48]

    J. Gao, S. Belkhale, S. Dasari, A. Balakrishna, D. Shah, and D. Sadigh. A taxonomy for evaluating generalist robot policies.arXiv preprint arXiv:2503.01238, 2025

  49. [49]

    X. Li, P. Li, M. Liu, D. Wang, J. Liu, B. Kang, X. Ma, T. Kong, H. Zhang, and H. Liu. Towards generalist robot policies: What matters in building vision-language-action models, 2024. URL https://arxiv.org/abs/2412.14058

  50. [50]

    command completed

    C. Gao, Z. Liu, Z. Chi, J. Huang, X. Fei, Y . Hou, Y . Zhang, Y . Lin, Z. Fang, Z. Jiang, and L. Shao. Vla-os: Structuring and dissecting planning representations and paradigms in vision- language-action models, 2025. URLhttps://arxiv.org/abs/2506.17561. 12 Supplementary Materials (a) Step 1 (b) Step 2 (c) Step 3 (d) Step 4 (e) Step 5 (f) Step 6 (g) Step ...

  51. [51]

    The command should facilitate completion of the given task

    Output a single command that should be executed immediately. The command should facilitate completion of the given task

  52. [52]

    Consider the affordance of the VLA based on the history steps as well as the current state of the robot

    The command should be doable within 10 seconds. Consider the affordance of the VLA based on the history steps as well as the current state of the robot

  53. [53]

    Do not output your thought process

    Think step by step internally to arrive at the command. Do not output your thought process. Current Memory: [Memory]. VLM Policy Output:[Language Command] F Success Detection Prompt Success Detection Prompt Image Input:[A sequence of Observations] Text Input:You are a success detector for a robot. Your job is to check whether the robot has successfully co...

  54. [54]

    pick up requires that the object is NOT making contact with the table AND is in contact with the gripper

  55. [55]

    [Language Command]

    put in requires that the object is in contact with the container AND is NOT in contact with the gripper Has the robot completed the following command "[Language Command]"? Answer with only "yes" or "no" or "uncertain" (in lowercase). VLM Output:[yes / no / uncertain] 3 G Observation Description Prompt Observation Description Prompt Image Input:[Current Ob...