pith. sign in

arxiv: 2606.24884 · v1 · pith:MIQKA47Nnew · submitted 2026-06-23 · 💻 cs.RO · cs.AI· cs.LG

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Pith reviewed 2026-06-26 00:09 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords vision-language-actionskill acquisitionrobot manipulationprimitive actionsself-guided learningcontinual learningVLM-guided data generation
0
0 comments X

The pith

InSight renders vision-language-action models steerable at the primitive-action level to enable autonomous skill acquisition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models learn manipulation skills from demonstrations but remain limited to the skills present in their training data. InSight addresses this bound by introducing steerability at the level of primitive actions such as moving a gripper to an object or pouring from a bottle. The method first applies an automated segmentation pipeline that decomposes demonstrations into labeled primitives using vision-language model plan decomposition together with end-effector poses. It then runs a vision-language model-guided data flywheel that detects missing primitives needed for a new task, generates candidate demonstrations through proposed low-level controls, and automatically incorporates successful ones back into the training set. If the approach holds, primitives learned this way can be composed to solve novel long-horizon tasks without any additional human demonstrations of those target skills.

Core claim

The paper claims that rendering VLAs steerable at the primitive-action level, through an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses, plus a VLM-guided data flywheel that identifies missing primitives, autonomously attempts demonstrations with VLM-proposed low-level controls, and integrates successful ones, provides a practical foundation for continual skill acquisition, as shown by learning tasks such as block flipping, drawer closing, sweeping, twisting, and pouring with no human demonstrations of the target skills.

What carries the argument

Primitive steerability, achieved by automated segmentation of demonstrations into labeled primitives and VLM-guided generation of new demonstrations for missing primitives.

Load-bearing premise

The automated segmentation pipeline reliably partitions demonstrations into accurate, labeled primitives and the VLM can propose low-level controls that produce successful, automatically labelable demonstrations for missing primitives.

What would settle it

Running the full pipeline on a target task such as pouring and finding that either segmentation produces inaccurate primitive labels or the generated demonstrations consistently fail to succeed and receive labels, so that the VLA never acquires the new primitive.

Figures

Figures reproduced from arXiv: 2606.24884 by Jiajun Wu, Lars Osterberg, Mac Schwager, Maggie Wang, Ola Shorinwa, Stephen Tian.

Figure 1
Figure 1. Figure 1: Overview of INSIGHT. (1) Human demonstrations are automatically segmented into primitive-labeled trajectories to fine-tune a VLA to be steerable via these primitive labels. (2) Given a novel task, a VLM identifies missing primitives, autonomously collects successful rollouts, and retrains the VLA with the new primitives. (3) The newly acquired primitives (e.g., twisting and pouring) can be composed to lear… view at source ↗
Figure 2
Figure 2. Figure 2: INSIGHT overview. (a) Stage 1 builds a steerable VLA from primitive-segmented demon￾strations. (b) Stage 2 uses a VLM to identify and acquire missing primitives for novel tasks, adding successful rollouts back into the VLA. A primitive is a reusable action segment that the VLA produces when conditioned on its language label. Following the precondition formalism of task and motion planning (TAMP) [8], each … view at source ↗
Figure 3
Figure 3. Figure 3: Block flip sample efficiency: INSIGHT vs. RL. Full flip success rate as a function of to￾tal environment rollouts (task attempts), with the number of rotate-block primitives in grey. The RL SAC [37] baseline (given the same rollout budget) does not complete a flip (0%), although it learns to reach the block (in 23% of episodes) and grasp it (in 10% of episodes), but never lifts and rotates it to completion… view at source ↗
Figure 4
Figure 4. Figure 4: Drawer closing. A VLA is trained only on open-drawer de￾mos (left). Closing the drawer (right) requires a new push drawer closed primitive executed from an open drawer, which is an OOD ini￾tial state for the base policy. IN￾SIGHT can use a VLM comple￾tion check to terminate the known approach primitive and trigger the new push drawer primitive. The base policy is trained only on drawer-opening demon￾strati… view at source ↗
Figure 5
Figure 5. Figure 5: Compositional twist-then-pour evaluation rollout. INSIGHT chains 14 primitives from the separately acquired twist and pour skills, with no end-to-end demonstrations of the combined task. Shaded headers mark primitives acquired autonomously by INSIGHT and added back into the VLA’s vocabulary; unshaded primitives are already known from the pick-and-place base demonstra￾tions. The step/progress value shown in… view at source ↗
Figure 6
Figure 6. Figure 6: Real-world per-primitive success rates, 25 trials per method. Each marker is the success rate of the labeled primitive across rollouts; Overall / End-to-end is full-task success. The π0.5 baseline is fine-tuned on 50 human pick-and-place demos; INSIGHT additionally uses 20 successful acquired primitive episodes. In the cap twisting (left), bottle pouring (center), and twist-then-pour (right) tasks, INSIGHT… view at source ↗
Figure 8
Figure 8. Figure 8: Base skill retention. The unified VLA retains the orig￾inal pick-and-place skills after adding twist and pour primitives (N=15). zero-shot at test time without expanding the learned policy, as well as π0.5, a fine-tuned policy with only human demonstrations and no new primitives from INSIGHT. Per-primitive reliability leads to high end-to-end success [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sweeping from only scooping human demonstrations. Exterior and wrist views of the demonstrated scooping skill (top) and the sweeping skill acquired through INSIGHT (bottom). Since both scooping and sweeping require the gripper to be lowered to the rocks, INSIGHT acquires sweeping by adding a lateral-push primitive to the scooping primitives. across the surface, as shown in [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
read the original abstract

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents InSight, a framework for autonomous skill acquisition in Vision-Language-Action (VLA) policies. It consists of an automated segmentation pipeline that partitions demonstrations into labeled primitives using VLM plan decomposition and end-effector poses, and a VLM-guided data flywheel that identifies missing primitives for novel tasks, generates demonstrations autonomously with VLM-proposed controls, and integrates successful ones into the training set. The paper claims evaluations on simulation and real-world tasks such as block flipping, drawer closing, sweeping, twisting, and pouring without human demonstrations of the target skills, enabling composition for long-horizon tasks.

Significance. If the central claims hold, the work could offer a practical mechanism for continual, self-guided skill acquisition in VLAs by leveraging primitive-level steerability, potentially reducing the need for extensive human demonstrations in robotic manipulation. This addresses a key limitation in current VLA models where capabilities are bounded by training data.

major comments (3)
  1. [Abstract] Abstract: The abstract states that evaluations were performed across simulation and real-world tasks but supplies no quantitative results, success rates, baselines, or error analysis. This absence makes it impossible to assess whether the VLM-guided flywheel reliably produces successful and automatically labelable demonstrations.
  2. [Abstract] Abstract and flywheel description: The central claim that primitive steerability enables autonomous acquisition without human demonstrations rests on the unverified assumption that the VLM can propose low-level controls yielding trajectories that are both task-successful and correctly segmented by the same pipeline; no attempt-success fractions, failure-mode analysis, or autonomous success-scoring procedure are reported.
  3. [Method (segmentation pipeline)] Segmentation pipeline description: No evidence or analysis is provided on whether the automated segmentation (VLM plan decomposition plus end-effector poses) remains accurate on novel VLM-generated trajectories rather than the original human demonstrations, which is load-bearing for the flywheel to close without external supervision.
minor comments (1)
  1. The project website is referenced but the manuscript does not indicate whether it supplies videos, code, or additional quantitative results that would aid reproducibility assessment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states that evaluations were performed across simulation and real-world tasks but supplies no quantitative results, success rates, baselines, or error analysis. This absence makes it impossible to assess whether the VLM-guided flywheel reliably produces successful and automatically labelable demonstrations.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report representative success rates from the simulation and real-world evaluations (e.g., for pouring and sweeping) and note the baseline comparisons performed. The full quantitative results, baselines, and error analysis already appear in the experimental section; the abstract revision will simply surface these numbers at the summary level. revision: yes

  2. Referee: [Abstract] Abstract and flywheel description: The central claim that primitive steerability enables autonomous acquisition without human demonstrations rests on the unverified assumption that the VLM can propose low-level controls yielding trajectories that are both task-successful and correctly segmented by the same pipeline; no attempt-success fractions, failure-mode analysis, or autonomous success-scoring procedure are reported.

    Authors: The manuscript reports overall task success after flywheel integration but does not provide granular attempt-success fractions or a dedicated failure-mode breakdown for the autonomous generation step. We will add a concise subsection (or expanded paragraph) that reports these fractions, describes the observed failure modes, and clarifies the automatic success-scoring procedure used to accept demonstrations into the training set. This addition will directly support the central claim. revision: yes

  3. Referee: [Method (segmentation pipeline)] Segmentation pipeline description: No evidence or analysis is provided on whether the automated segmentation (VLM plan decomposition plus end-effector poses) remains accurate on novel VLM-generated trajectories rather than the original human demonstrations, which is load-bearing for the flywheel to close without external supervision.

    Authors: We acknowledge that the current manuscript validates the segmentation pipeline primarily on the initial human demonstrations and does not include an explicit accuracy comparison on VLM-generated trajectories. This is a substantive gap for the self-supervised claim. We will add targeted analysis (either quantitative metrics or qualitative examples) demonstrating segmentation performance on the autonomously generated trajectories, thereby confirming that the flywheel can operate without external labeling. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical framework relying on external VLM and VLA components whose performance is treated as given inputs rather than quantities derived within the work. No mathematical derivations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior author work appear in the provided text. The central claim rests on experimental outcomes across tasks rather than any self-referential reduction of outputs to inputs by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the framework rests on the domain assumption that VLMs can accurately decompose plans and propose usable low-level controls; no free parameters or invented entities are described.

axioms (1)
  • domain assumption VLMs can reliably decompose demonstrations into labeled primitives and propose low-level controls that succeed often enough for the flywheel to improve the VLA.
    Invoked in the description of both the segmentation pipeline and the data flywheel stages.

pith-pipeline@v0.9.1-grok · 5774 in / 1349 out tokens · 26020 ms · 2026-06-26T00:09:00.226104+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 1 canonical work pages

  1. [1]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An Open-Source Vision-Language-Action Model, June

  2. [2]

    arXiv:2406.09246 [cs]

    URLhttp://arxiv.org/abs/2406.09246. arXiv:2406.09246 [cs]

  3. [3]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  4. [4]

    Bjorck, F

    J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhan...

  5. [5]

    NASA’s InSight Waits Out Dust Storm - NASA, Oct. 2022. URLhttps://www.nasa.gov/ missions/insight/nasas-insight-waits-out-dust-storm/. Section: InSight (Inte- rior Exploration using Seismic Investigations, Geodesy and Heat Transport)

  6. [6]

    Kalashnikov, A

    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, and S. Levine. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018. URLhttps://arxiv.org/abs/1806.10293

  7. [7]

    Wagenmaker, M

    A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering Your Diffusion Policy with Latent Space Reinforcement Learning, June

  8. [8]

    URLhttps://arxiv.org/abs/2506.15799v2

  9. [9]

    Z. Gu, M. Yang, D. Zou, and D. Xu. Learning Diffusion Policy from Primitive Skills for Robot Manipulation, Jan. 2026. URLhttp://arxiv.org/abs/2601.01948. arXiv:2601.01948 [cs]

  10. [10]

    C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano- Pérez. Integrated Task and Motion Planning, Oct. 2020. URLhttp://arxiv.org/abs/ 2010.01083. arXiv:2010.01083 [cs.RO]

  11. [11]

    W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierar- chical Control, Feb. 2026. URLhttp://arxiv.org/abs/2602.13193. arXiv:2602.13193 [cs]

  12. [12]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    L. Smith, A. Irpan, M. G. Arenas, S. Kirmani, D. Kalashnikov, D. Shah, and T. Xiao. STEER: Flexible Robotic Manipulation via Dense Language Grounding. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 16517–16524, May 2025. doi: 10.1109/ICRA55743.2025.11127404. URLhttps://ieeexplore.ieee.org/document/ 11127404/

  13. [13]

    Liang, W

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as Policies: Language Model Programs for Embodied Control, May 2023. URLhttp://arxiv. org/abs/2209.07753. arXiv:2209.07753 [cs]

  14. [14]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. 10 Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Re...

  15. [15]

    URLhttps://arxiv.org/abs/2204.01691v2

  16. [16]

    Huang, C

    W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, Nov. 2023. URLhttp://arxiv. org/abs/2307.05973. arXiv:2307.05973 [cs.RO]

  17. [17]

    M. Fu, J. Yu, K. El-Refai, E. Kou, H. Xue, H. Huang, W. Xiao, G. Wang, F.-F. Li, G. Shi, J. Wu, S. Sastry, Y . Zhu, K. Goldberg, and L. J. Fan. CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation, Mar. 2026. URLhttp://arxiv. org/abs/2603.22435. arXiv:2603.22435 [cs]

  18. [18]

    Intelligence, B

    P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokin- sky, S. Cao, T. Charbonnier, V . Choudhary, F. Collins, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, M. Dhaka, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glos- sop, T. Godden, I. Goryachev, L. Groom, H. Habeeb, H. Hancock, K. Hausman, G. H...

  19. [19]

    S. Liu, I. S. Singh, Y . Xu, J. Duan, and R. Krishna. VLS: Steering Pretrained Robot Poli- cies via Vision-Language Models, Feb. 2026. URLhttp://arxiv.org/abs/2602.03973. arXiv:2602.03973 [cs]

  20. [20]

    N. B. Gutierrez, J. M. Cloud, and W. J. Beksi. Movement primitives in robotics: A compre- hensive survey, 2026. URLhttps://arxiv.org/abs/2601.02379

  21. [21]

    B. Lee, Y . Lee, S. Kim, M. Son, and F. C. Park. Equivariant Motion Manifold Primitives. In Proceedings of The 7th Conference on Robot Learning, pages 1199–1221. PMLR, Dec. 2023. URLhttps://proceedings.mlr.press/v229/lee23a.html

  22. [22]

    W. Liu, N. Nie, R. Zhang, J. Mao, and J. Wu. Learning Compositional Behaviors from Demon- stration and Language, 2025. URLhttps://arxiv.org/abs/2505.21981. Version Num- ber: 1

  23. [23]

    Y . Zhu, P. Stone, and Y . Zhu. Bottom-Up Skill Discovery from Unsegmented Demonstrations for Long-Horizon Robot Manipulation, Jan. 2022. URLhttp://arxiv.org/abs/2109. 13841. arXiv:2109.13841 [cs]

  24. [24]

    A. Adeniji. Learning Representations for Unsupervised Skill Discovery. 2024. URLhttps: //purl.stanford.edu/sb108vw6601

  25. [25]

    Cathomen, M

    R. Cathomen, M. Mittal, M. Vlastelica, and M. Hutter. Divide, Discover, Deploy: Factorized Skill Learning with Symmetry and Style Priors. 2025

  26. [26]

    N. Nie, W. Huang, J. Mao, L. Fei-Fei, W. Liu, and J. Wu. Learning composable skills by discovering spatial and temporal structure with foundation models. InIEEE International Conference on Robotics and Automation (ICRA), 2026

  27. [27]

    L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, A. Li-Bell, D. Driess, L. Groom, S. Levine, and C. Finn. Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models, Feb. 2025. URL https://arxiv.org/abs/2502.19417v2

  28. [28]

    C. Xu, Q. Li, J. Luo, and S. Levine. RLDG: Robotic Generalist Policy Distillation via Rein- forcement Learning, Dec. 2024. URLhttps://arxiv.org/abs/2412.09858v1. 11

  29. [29]

    Zhang, Y

    J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang. ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations, Sept. 2025. URLhttp://arxiv.org/abs/2505.10911. arXiv:2505.10911 [cs]

  30. [30]

    Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-Level Reward Design via Coding Large Language Models, Apr. 2024. URLhttp://arxiv.org/abs/2310.12931. arXiv:2310.12931 [cs]

  31. [31]

    X. Zhao, C. Weber, and S. Wermter. Agentic Skill Discovery, Aug. 2024. URLhttp:// arxiv.org/abs/2405.15019. arXiv:2405.15019 [cs]

  32. [32]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mim- icGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations. 2023

  33. [33]

    J. Duan, W. Yuan, W. Pumacay, Y . R. Wang, K. Ehsani, D. Fox, and R. Krishna. Manipulate- Anything: Automating Real-World Robots using Vision-Language Models, Aug. 2024. URL http://arxiv.org/abs/2406.18915. arXiv:2406.18915 [cs.RO]

  34. [34]

    H. Ha, P. Florence, and S. Song. Scaling Up and Distilling Down: Language- Guided Robot Skill Acquisition, Oct. 2023. URLhttp://arxiv.org/abs/2307.14535. arXiv:2307.14535 [cs]

  35. [35]

    Cheng, Z

    S. Cheng, Z. Li, K. Yu, and D. Xu. Continual Robot Learning via Language-Guided Skill Acquisition. 2025

  36. [36]

    Y . Wu, G. Wang, Z. Yang, M. Yao, B. Sheil, and H. Wang. Continually Evolving Skill Knowledge in Vision Language Action Model, 2025. URLhttps://arxiv.org/abs/2511. 18085. Version Number: 2

  37. [37]

    X. Wang, Z. Han, Z. Liu, G. Li, J. Dong, B. Liu, L. Liu, and Z. Han. Lifelong Language- Conditioned Robotic Manipulation Learning, Mar. 2026. URLhttp://arxiv.org/abs/ 2603.05160. arXiv:2603.05160 [cs.RO]

  38. [38]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low- rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106. 09685

  39. [39]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning, Oct. 2023. URLhttp://arxiv.org/ abs/2306.03310. arXiv:2306.03310 [cs]

  40. [40]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Jan. 2018. URLhttps: //arxiv.org/abs/1801.01290v2

  41. [41]

    S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning, Sept. 2025. URLhttps://arxiv.org/abs/2509.15937v1

  42. [42]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Cheb- otar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang. World Action M...

  43. [43]

    B. Hou, G. Li, J. Jia, T. An, X. Guo, S. Leng, H. Geng, Y . Ze, T. Harada, P. Torr, O. Mees, M. Pollefeys, Z. Liu, J. Wu, P. Abbeel, J. Malik, Y . Du, and J. Yang. World Model for Robot Learning: A Comprehensive Survey, Apr. 2026. URLhttp://arxiv.org/abs/ 2605.00080. arXiv:2605.00080 [cs]

  44. [44]

    Gemini 3: Advancing multimodal intelligence, agentic workflows, and deep reasoning

    Gemini Team, Google DeepMind. Gemini 3: Advancing multimodal intelligence, agentic workflows, and deep reasoning. Technical report, Google DeepMind, 2025. URLhttps: //deepmind.google/technologies/gemini. 12 A Implementation Details We use theπ 0.5 VLA [2] in our experiments, although INSIGHTis agnostic to the underlying VLA. We fine-tune with LoRA [35] (G...

  45. [45]

    Use existing primitives for every sub-step they cover -- a skill gap should only be the novel part, not a bundle of existing + novel actions

    Break the goal into fine-grained steps. Use existing primitives for every sub-step they cover -- a skill gap should only be the novel part, not a bundle of existing + novel actions

  46. [46]

    If an existing primitive could achieve the same result (even if executed differently), use it and put execution details in step_notes instead

    Only create a skill gap when the desired outcome is fundamentally different from what any existing primitive produces. If an existing primitive could achieve the same result (even if executed differently), use it and put execution details in step_notes instead

  47. [47]

    Every step goes in primitive_sequence -- including new ones

  48. [48]

    New primitives also go in skill_gaps (must appear in BOTH lists)

  49. [49]

    Name new primitives by their desired EFFECT, not the robot motion

  50. [50]

    For each step, add a note on execution (approach, grasp, how it enables the next step)

  51. [51]

    Each step should make a distinguishable contribution to the goal -- avoid adding a final step whose only effect is repositioning the gripper

    After the final step, the runtime returns the gripper to a safe home pose, so the gripper does not need to be cleared from the workspace by a final step in the plan. Each step should make a distinguishable contribution to the goal -- avoid adding a final step whose only effect is repositioning the gripper

  52. [52]

    move gripper to the red lego block

    Each skill gap is one single-axis motion (one translation OR one rotation along one axis, in one direction). If the goal involves multiple distinct motions, create a separate skill gap for each. Example 1 -- pick and place (all existing, no skill gaps): primitive_sequence: ["move gripper to the red lego block", "close gripper", "lift upward", "move grippe...

  53. [53]

    Never select drz for any motion that requires an object to tip over, invert, or pivot its top towards a target; drz only spins the object on its own axis

  54. [54]

    current_state

    The wrist camera moves with the gripper; its local axes are independent of the global room frame. Never select an axis based on where a target appears to sit (left, right, up, down) in IMAGE 1. Map the required tilt strictly to the local structure of the gripper fingers in IMAGE 2. BE AW ARE: Depth and gripper biases may exist due to the close-up wrist vi...

  55. [55]

    KNOWN move gripper above the yellow bottle cap— Move the gripper into a top-down approach position centered over the yellow cap

  56. [56]

    KNOWN close gripper— Close the gripper to secure a firm grasp on the cap

  57. [57]

    KNOWN twist open the cap— Perform a 180-degree counterclockwise rotation to unscrew the cap from the bottle

  58. [58]

    KNOWN lift upward— Lift the cap vertically to ensure it is completely detached from the bottle threads

  59. [59]

    KNOWN open gripper— Open the gripper to drop the detached cap onto the workspace

  60. [60]

    KNOWN return to home— Execute the mandatory hardware reset to return the robot to its canon- ical home pose

  61. [61]

    KNOWN move gripper to the side of the yellow bottle body— Move the gripper to a side- approach position relative to the bottle body

  62. [62]

    KNOWN close gripper— Close the gripper to perform a side grasp on the now-uncapped bottle

  63. [63]

    KNOWN lift upward— Lift the bottle upward to clear the table for movement

  64. [64]

    KNOWN move gripper to the side of the bowl— Transport the bottle to the side of the bowl in preparation for pouring

  65. [65]

    KNOWN tilt bottle forward to pour— Tilt the bottle forward over the bowl to empty its contents

  66. [66]

    KNOWN tilt bottle back upright— Rotate the bottle back to a vertical, upright orientation

  67. [67]

    KNOWN lower gripper— Lower the bottle back down to the table surface

  68. [68]

    KNOWN open gripper— Open the gripper to release the bottle. 19