InSight: Self-Guided Skill Acquisition via Steerable VLAs

Jiajun Wu; Lars Osterberg; Mac Schwager; Maggie Wang; Ola Shorinwa; Stephen Tian

arxiv: 2606.24884 · v1 · pith:MIQKA47Nnew · submitted 2026-06-23 · 💻 cs.RO · cs.AI· cs.LG

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Maggie Wang , Lars Osterberg , Stephen Tian , Ola Shorinwa , Jiajun Wu , Mac Schwager This is my paper

Pith reviewed 2026-06-26 00:09 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords vision-language-actionskill acquisitionrobot manipulationprimitive actionsself-guided learningcontinual learningVLM-guided data generation

0 comments

The pith

InSight renders vision-language-action models steerable at the primitive-action level to enable autonomous skill acquisition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models learn manipulation skills from demonstrations but remain limited to the skills present in their training data. InSight addresses this bound by introducing steerability at the level of primitive actions such as moving a gripper to an object or pouring from a bottle. The method first applies an automated segmentation pipeline that decomposes demonstrations into labeled primitives using vision-language model plan decomposition together with end-effector poses. It then runs a vision-language model-guided data flywheel that detects missing primitives needed for a new task, generates candidate demonstrations through proposed low-level controls, and automatically incorporates successful ones back into the training set. If the approach holds, primitives learned this way can be composed to solve novel long-horizon tasks without any additional human demonstrations of those target skills.

Core claim

The paper claims that rendering VLAs steerable at the primitive-action level, through an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses, plus a VLM-guided data flywheel that identifies missing primitives, autonomously attempts demonstrations with VLM-proposed low-level controls, and integrates successful ones, provides a practical foundation for continual skill acquisition, as shown by learning tasks such as block flipping, drawer closing, sweeping, twisting, and pouring with no human demonstrations of the target skills.

What carries the argument

Primitive steerability, achieved by automated segmentation of demonstrations into labeled primitives and VLM-guided generation of new demonstrations for missing primitives.

Load-bearing premise

The automated segmentation pipeline reliably partitions demonstrations into accurate, labeled primitives and the VLM can propose low-level controls that produce successful, automatically labelable demonstrations for missing primitives.

What would settle it

Running the full pipeline on a target task such as pouring and finding that either segmentation produces inaccurate primitive labels or the generated demonstrations consistently fail to succeed and receive labels, so that the VLA never acquires the new primitive.

Figures

Figures reproduced from arXiv: 2606.24884 by Jiajun Wu, Lars Osterberg, Mac Schwager, Maggie Wang, Ola Shorinwa, Stephen Tian.

**Figure 1.** Figure 1: Overview of INSIGHT. (1) Human demonstrations are automatically segmented into primitive-labeled trajectories to fine-tune a VLA to be steerable via these primitive labels. (2) Given a novel task, a VLM identifies missing primitives, autonomously collects successful rollouts, and retrains the VLA with the new primitives. (3) The newly acquired primitives (e.g., twisting and pouring) can be composed to lear… view at source ↗

**Figure 2.** Figure 2: INSIGHT overview. (a) Stage 1 builds a steerable VLA from primitive-segmented demonstrations. (b) Stage 2 uses a VLM to identify and acquire missing primitives for novel tasks, adding successful rollouts back into the VLA. A primitive is a reusable action segment that the VLA produces when conditioned on its language label. Following the precondition formalism of task and motion planning (TAMP) [8], each … view at source ↗

**Figure 3.** Figure 3: Block flip sample efficiency: INSIGHT vs. RL. Full flip success rate as a function of total environment rollouts (task attempts), with the number of rotate-block primitives in grey. The RL SAC [37] baseline (given the same rollout budget) does not complete a flip (0%), although it learns to reach the block (in 23% of episodes) and grasp it (in 10% of episodes), but never lifts and rotates it to completion… view at source ↗

**Figure 4.** Figure 4: Drawer closing. A VLA is trained only on open-drawer demos (left). Closing the drawer (right) requires a new push drawer closed primitive executed from an open drawer, which is an OOD initial state for the base policy. INSIGHT can use a VLM completion check to terminate the known approach primitive and trigger the new push drawer primitive. The base policy is trained only on drawer-opening demonstrati… view at source ↗

**Figure 5.** Figure 5: Compositional twist-then-pour evaluation rollout. INSIGHT chains 14 primitives from the separately acquired twist and pour skills, with no end-to-end demonstrations of the combined task. Shaded headers mark primitives acquired autonomously by INSIGHT and added back into the VLA’s vocabulary; unshaded primitives are already known from the pick-and-place base demonstrations. The step/progress value shown in… view at source ↗

**Figure 6.** Figure 6: Real-world per-primitive success rates, 25 trials per method. Each marker is the success rate of the labeled primitive across rollouts; Overall / End-to-end is full-task success. The π0.5 baseline is fine-tuned on 50 human pick-and-place demos; INSIGHT additionally uses 20 successful acquired primitive episodes. In the cap twisting (left), bottle pouring (center), and twist-then-pour (right) tasks, INSIGHT… view at source ↗

**Figure 8.** Figure 8: Base skill retention. The unified VLA retains the original pick-and-place skills after adding twist and pour primitives (N=15). zero-shot at test time without expanding the learned policy, as well as π0.5, a fine-tuned policy with only human demonstrations and no new primitives from INSIGHT. Per-primitive reliability leads to high end-to-end success [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Sweeping from only scooping human demonstrations. Exterior and wrist views of the demonstrated scooping skill (top) and the sweeping skill acquired through INSIGHT (bottom). Since both scooping and sweeping require the gripper to be lowered to the rocks, INSIGHT acquires sweeping by adding a lateral-push primitive to the scooping primitives. across the surface, as shown in [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

read the original abstract

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InSight sketches a clean flywheel for VLA primitive acquisition but the abstract supplies zero numbers on whether the VLM proposals actually succeed or label correctly.

read the letter

The main takeaway is that this paper tries to close the data loop for VLAs by first carving demonstrations into labeled primitives and then letting a VLM spot missing ones, propose low-level controls, and feed successful runs back into training. The claim is that once those primitives exist, they compose into new long-horizon tasks without extra human data.

The segmentation stage looks workable: it combines VLM plan decomposition with end-effector poses to produce steerable units such as gripper moves or pours. That gives the policy a finer-grained interface than whole-trajectory conditioning. The flywheel then uses the same VLM to identify gaps for a target task and attempts to generate the missing primitives autonomously. The listed tasks (flipping, drawer closing, sweeping, twisting, pouring) are standard manipulation benchmarks and the sim-to-real split is sensible.

What the work does well is name a real limitation—VLAs are capped by their initial demonstration set—and lay out an explicit mechanism to grow that set without constant human labeling. The architecture is modular and reuses existing VLM and VLA components rather than inventing new models from scratch.

The soft spot is exactly where the stress-test note points: the abstract states the flywheel runs “autonomously” and “without any human demonstrations” yet reports no success rates, no attempt counts, no failure modes, and no check on whether the automatic labeler stays accurate on VLM-generated trajectories. If the VLM proposals fail often or the segmentation mislabels novel motions, the loop cannot close and the steerability benefit disappears. Treating the external VLM as a reliable oracle is a strong assumption that needs evidence.

This is for people already working on vision-language-action models who want concrete ideas for continual acquisition. A reader focused on reducing human data collection would find the structure worth looking at.

It should go to peer review. The problem is well-posed and the proposed pieces are coherent; the authors simply need to supply the missing quantitative results and ablation on the flywheel’s reliability before the central claim can be assessed.

Referee Report

3 major / 1 minor

Summary. The manuscript presents InSight, a framework for autonomous skill acquisition in Vision-Language-Action (VLA) policies. It consists of an automated segmentation pipeline that partitions demonstrations into labeled primitives using VLM plan decomposition and end-effector poses, and a VLM-guided data flywheel that identifies missing primitives for novel tasks, generates demonstrations autonomously with VLM-proposed controls, and integrates successful ones into the training set. The paper claims evaluations on simulation and real-world tasks such as block flipping, drawer closing, sweeping, twisting, and pouring without human demonstrations of the target skills, enabling composition for long-horizon tasks.

Significance. If the central claims hold, the work could offer a practical mechanism for continual, self-guided skill acquisition in VLAs by leveraging primitive-level steerability, potentially reducing the need for extensive human demonstrations in robotic manipulation. This addresses a key limitation in current VLA models where capabilities are bounded by training data.

major comments (3)

[Abstract] Abstract: The abstract states that evaluations were performed across simulation and real-world tasks but supplies no quantitative results, success rates, baselines, or error analysis. This absence makes it impossible to assess whether the VLM-guided flywheel reliably produces successful and automatically labelable demonstrations.
[Abstract] Abstract and flywheel description: The central claim that primitive steerability enables autonomous acquisition without human demonstrations rests on the unverified assumption that the VLM can propose low-level controls yielding trajectories that are both task-successful and correctly segmented by the same pipeline; no attempt-success fractions, failure-mode analysis, or autonomous success-scoring procedure are reported.
[Method (segmentation pipeline)] Segmentation pipeline description: No evidence or analysis is provided on whether the automated segmentation (VLM plan decomposition plus end-effector poses) remains accurate on novel VLM-generated trajectories rather than the original human demonstrations, which is load-bearing for the flywheel to close without external supervision.

minor comments (1)

The project website is referenced but the manuscript does not indicate whether it supplies videos, code, or additional quantitative results that would aid reproducibility assessment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that evaluations were performed across simulation and real-world tasks but supplies no quantitative results, success rates, baselines, or error analysis. This absence makes it impossible to assess whether the VLM-guided flywheel reliably produces successful and automatically labelable demonstrations.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report representative success rates from the simulation and real-world evaluations (e.g., for pouring and sweeping) and note the baseline comparisons performed. The full quantitative results, baselines, and error analysis already appear in the experimental section; the abstract revision will simply surface these numbers at the summary level. revision: yes
Referee: [Abstract] Abstract and flywheel description: The central claim that primitive steerability enables autonomous acquisition without human demonstrations rests on the unverified assumption that the VLM can propose low-level controls yielding trajectories that are both task-successful and correctly segmented by the same pipeline; no attempt-success fractions, failure-mode analysis, or autonomous success-scoring procedure are reported.

Authors: The manuscript reports overall task success after flywheel integration but does not provide granular attempt-success fractions or a dedicated failure-mode breakdown for the autonomous generation step. We will add a concise subsection (or expanded paragraph) that reports these fractions, describes the observed failure modes, and clarifies the automatic success-scoring procedure used to accept demonstrations into the training set. This addition will directly support the central claim. revision: yes
Referee: [Method (segmentation pipeline)] Segmentation pipeline description: No evidence or analysis is provided on whether the automated segmentation (VLM plan decomposition plus end-effector poses) remains accurate on novel VLM-generated trajectories rather than the original human demonstrations, which is load-bearing for the flywheel to close without external supervision.

Authors: We acknowledge that the current manuscript validates the segmentation pipeline primarily on the initial human demonstrations and does not include an explicit accuracy comparison on VLM-generated trajectories. This is a substantive gap for the self-supervised claim. We will add targeted analysis (either quantitative metrics or qualitative examples) demonstrating segmentation performance on the autonomously generated trajectories, thereby confirming that the flywheel can operate without external labeling. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical framework relying on external VLM and VLA components whose performance is treated as given inputs rather than quantities derived within the work. No mathematical derivations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior author work appear in the provided text. The central claim rests on experimental outcomes across tasks rather than any self-referential reduction of outputs to inputs by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the framework rests on the domain assumption that VLMs can accurately decompose plans and propose usable low-level controls; no free parameters or invented entities are described.

axioms (1)

domain assumption VLMs can reliably decompose demonstrations into labeled primitives and propose low-level controls that succeed often enough for the flywheel to improve the VLA.
Invoked in the description of both the segmentation pipeline and the data flywheel stages.

pith-pipeline@v0.9.1-grok · 5774 in / 1349 out tokens · 26020 ms · 2026-06-26T00:09:00.226104+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 1 canonical work pages

[1]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An Open-Source Vision-Language-Action Model, June
[2]

arXiv:2406.09246 [cs]

URLhttp://arxiv.org/abs/2406.09246. arXiv:2406.09246 [cs]

Pith/arXiv arXiv
[3]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

Pith/arXiv arXiv 2025
[4]

Bjorck, F

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhan...

Pith/arXiv arXiv 2025
[5]

NASA’s InSight Waits Out Dust Storm - NASA, Oct. 2022. URLhttps://www.nasa.gov/ missions/insight/nasas-insight-waits-out-dust-storm/. Section: InSight (Inte- rior Exploration using Seismic Investigations, Geodesy and Heat Transport)

2022
[6]

Kalashnikov, A

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, and S. Levine. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018. URLhttps://arxiv.org/abs/1806.10293

Pith/arXiv arXiv 2018
[7]

Wagenmaker, M

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering Your Diffusion Policy with Latent Space Reinforcement Learning, June
[8]

URLhttps://arxiv.org/abs/2506.15799v2

Pith/arXiv arXiv
[9]

Z. Gu, M. Yang, D. Zou, and D. Xu. Learning Diffusion Policy from Primitive Skills for Robot Manipulation, Jan. 2026. URLhttp://arxiv.org/abs/2601.01948. arXiv:2601.01948 [cs]

arXiv 2026
[10]

C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano- Pérez. Integrated Task and Motion Planning, Oct. 2020. URLhttp://arxiv.org/abs/ 2010.01083. arXiv:2010.01083 [cs.RO]

arXiv 2020
[11]

W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierar- chical Control, Feb. 2026. URLhttp://arxiv.org/abs/2602.13193. arXiv:2602.13193 [cs]

Pith/arXiv arXiv 2026
[12]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

L. Smith, A. Irpan, M. G. Arenas, S. Kirmani, D. Kalashnikov, D. Shah, and T. Xiao. STEER: Flexible Robotic Manipulation via Dense Language Grounding. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 16517–16524, May 2025. doi: 10.1109/ICRA55743.2025.11127404. URLhttps://ieeexplore.ieee.org/document/ 11127404/

work page doi:10.1109/icra55743.2025.11127404 2025
[13]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as Policies: Language Model Programs for Embodied Control, May 2023. URLhttp://arxiv. org/abs/2209.07753. arXiv:2209.07753 [cs]

Pith/arXiv arXiv 2023
[14]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. 10 Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Re...
[15]

URLhttps://arxiv.org/abs/2204.01691v2

Pith/arXiv arXiv
[16]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, Nov. 2023. URLhttp://arxiv. org/abs/2307.05973. arXiv:2307.05973 [cs.RO]

Pith/arXiv arXiv 2023
[17]

M. Fu, J. Yu, K. El-Refai, E. Kou, H. Xue, H. Huang, W. Xiao, G. Wang, F.-F. Li, G. Shi, J. Wu, S. Sastry, Y . Zhu, K. Goldberg, and L. J. Fan. CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation, Mar. 2026. URLhttp://arxiv. org/abs/2603.22435. arXiv:2603.22435 [cs]

arXiv 2026
[18]

Intelligence, B

P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokin- sky, S. Cao, T. Charbonnier, V . Choudhary, F. Collins, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, M. Dhaka, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glos- sop, T. Godden, I. Goryachev, L. Groom, H. Habeeb, H. Hancock, K. Hausman, G. H...

Pith/arXiv arXiv 2026
[19]

S. Liu, I. S. Singh, Y . Xu, J. Duan, and R. Krishna. VLS: Steering Pretrained Robot Poli- cies via Vision-Language Models, Feb. 2026. URLhttp://arxiv.org/abs/2602.03973. arXiv:2602.03973 [cs]

arXiv 2026
[20]

N. B. Gutierrez, J. M. Cloud, and W. J. Beksi. Movement primitives in robotics: A compre- hensive survey, 2026. URLhttps://arxiv.org/abs/2601.02379

Pith/arXiv arXiv 2026
[21]

B. Lee, Y . Lee, S. Kim, M. Son, and F. C. Park. Equivariant Motion Manifold Primitives. In Proceedings of The 7th Conference on Robot Learning, pages 1199–1221. PMLR, Dec. 2023. URLhttps://proceedings.mlr.press/v229/lee23a.html

2023
[22]

W. Liu, N. Nie, R. Zhang, J. Mao, and J. Wu. Learning Compositional Behaviors from Demon- stration and Language, 2025. URLhttps://arxiv.org/abs/2505.21981. Version Num- ber: 1

arXiv 2025
[23]

Y . Zhu, P. Stone, and Y . Zhu. Bottom-Up Skill Discovery from Unsegmented Demonstrations for Long-Horizon Robot Manipulation, Jan. 2022. URLhttp://arxiv.org/abs/2109. 13841. arXiv:2109.13841 [cs]

arXiv 2022
[24]

A. Adeniji. Learning Representations for Unsupervised Skill Discovery. 2024. URLhttps: //purl.stanford.edu/sb108vw6601

2024
[25]

Cathomen, M

R. Cathomen, M. Mittal, M. Vlastelica, and M. Hutter. Divide, Discover, Deploy: Factorized Skill Learning with Symmetry and Style Priors. 2025

2025
[26]

N. Nie, W. Huang, J. Mao, L. Fei-Fei, W. Liu, and J. Wu. Learning composable skills by discovering spatial and temporal structure with foundation models. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026
[27]

L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, A. Li-Bell, D. Driess, L. Groom, S. Levine, and C. Finn. Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models, Feb. 2025. URL https://arxiv.org/abs/2502.19417v2

Pith/arXiv arXiv 2025
[28]

C. Xu, Q. Li, J. Luo, and S. Levine. RLDG: Robotic Generalist Policy Distillation via Rein- forcement Learning, Dec. 2024. URLhttps://arxiv.org/abs/2412.09858v1. 11

arXiv 2024
[29]

Zhang, Y

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang. ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations, Sept. 2025. URLhttp://arxiv.org/abs/2505.10911. arXiv:2505.10911 [cs]

arXiv 2025
[30]

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-Level Reward Design via Coding Large Language Models, Apr. 2024. URLhttp://arxiv.org/abs/2310.12931. arXiv:2310.12931 [cs]

Pith/arXiv arXiv 2024
[31]

X. Zhao, C. Weber, and S. Wermter. Agentic Skill Discovery, Aug. 2024. URLhttp:// arxiv.org/abs/2405.15019. arXiv:2405.15019 [cs]

arXiv 2024
[32]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mim- icGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations. 2023

2023
[33]

J. Duan, W. Yuan, W. Pumacay, Y . R. Wang, K. Ehsani, D. Fox, and R. Krishna. Manipulate- Anything: Automating Real-World Robots using Vision-Language Models, Aug. 2024. URL http://arxiv.org/abs/2406.18915. arXiv:2406.18915 [cs.RO]

arXiv 2024
[34]

H. Ha, P. Florence, and S. Song. Scaling Up and Distilling Down: Language- Guided Robot Skill Acquisition, Oct. 2023. URLhttp://arxiv.org/abs/2307.14535. arXiv:2307.14535 [cs]

arXiv 2023
[35]

Cheng, Z

S. Cheng, Z. Li, K. Yu, and D. Xu. Continual Robot Learning via Language-Guided Skill Acquisition. 2025

2025
[36]

Y . Wu, G. Wang, Z. Yang, M. Yao, B. Sheil, and H. Wang. Continually Evolving Skill Knowledge in Vision Language Action Model, 2025. URLhttps://arxiv.org/abs/2511. 18085. Version Number: 2

2025
[37]

X. Wang, Z. Han, Z. Liu, G. Li, J. Dong, B. Liu, L. Liu, and Z. Han. Lifelong Language- Conditioned Robotic Manipulation Learning, Mar. 2026. URLhttp://arxiv.org/abs/ 2603.05160. arXiv:2603.05160 [cs.RO]

arXiv 2026
[38]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low- rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106. 09685

2021
[39]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning, Oct. 2023. URLhttp://arxiv.org/ abs/2306.03310. arXiv:2306.03310 [cs]

Pith/arXiv arXiv 2023
[40]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Jan. 2018. URLhttps: //arxiv.org/abs/1801.01290v2

Pith/arXiv arXiv 2018
[41]

S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning, Sept. 2025. URLhttps://arxiv.org/abs/2509.15937v1

arXiv 2025
[42]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Cheb- otar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang. World Action M...

Pith/arXiv arXiv 2026
[43]

B. Hou, G. Li, J. Jia, T. An, X. Guo, S. Leng, H. Geng, Y . Ze, T. Harada, P. Torr, O. Mees, M. Pollefeys, Z. Liu, J. Wu, P. Abbeel, J. Malik, Y . Du, and J. Yang. World Model for Robot Learning: A Comprehensive Survey, Apr. 2026. URLhttp://arxiv.org/abs/ 2605.00080. arXiv:2605.00080 [cs]

Pith/arXiv arXiv 2026
[44]

Gemini 3: Advancing multimodal intelligence, agentic workflows, and deep reasoning

Gemini Team, Google DeepMind. Gemini 3: Advancing multimodal intelligence, agentic workflows, and deep reasoning. Technical report, Google DeepMind, 2025. URLhttps: //deepmind.google/technologies/gemini. 12 A Implementation Details We use theπ 0.5 VLA [2] in our experiments, although INSIGHTis agnostic to the underlying VLA. We fine-tune with LoRA [35] (G...

2025
[45]

Use existing primitives for every sub-step they cover -- a skill gap should only be the novel part, not a bundle of existing + novel actions

Break the goal into fine-grained steps. Use existing primitives for every sub-step they cover -- a skill gap should only be the novel part, not a bundle of existing + novel actions
[46]

If an existing primitive could achieve the same result (even if executed differently), use it and put execution details in step_notes instead

Only create a skill gap when the desired outcome is fundamentally different from what any existing primitive produces. If an existing primitive could achieve the same result (even if executed differently), use it and put execution details in step_notes instead
[47]

Every step goes in primitive_sequence -- including new ones
[48]

New primitives also go in skill_gaps (must appear in BOTH lists)
[49]

Name new primitives by their desired EFFECT, not the robot motion
[50]

For each step, add a note on execution (approach, grasp, how it enables the next step)
[51]

Each step should make a distinguishable contribution to the goal -- avoid adding a final step whose only effect is repositioning the gripper

After the final step, the runtime returns the gripper to a safe home pose, so the gripper does not need to be cleared from the workspace by a final step in the plan. Each step should make a distinguishable contribution to the goal -- avoid adding a final step whose only effect is repositioning the gripper
[52]

move gripper to the red lego block

Each skill gap is one single-axis motion (one translation OR one rotation along one axis, in one direction). If the goal involves multiple distinct motions, create a separate skill gap for each. Example 1 -- pick and place (all existing, no skill gaps): primitive_sequence: ["move gripper to the red lego block", "close gripper", "lift upward", "move grippe...
[53]

Never select drz for any motion that requires an object to tip over, invert, or pivot its top towards a target; drz only spins the object on its own axis
[54]

current_state

The wrist camera moves with the gripper; its local axes are independent of the global room frame. Never select an axis based on where a target appears to sit (left, right, up, down) in IMAGE 1. Map the required tilt strictly to the local structure of the gripper fingers in IMAGE 2. BE AW ARE: Depth and gripper biases may exist due to the close-up wrist vi...
[55]

KNOWN move gripper above the yellow bottle cap— Move the gripper into a top-down approach position centered over the yellow cap
[56]

KNOWN close gripper— Close the gripper to secure a firm grasp on the cap
[57]

KNOWN twist open the cap— Perform a 180-degree counterclockwise rotation to unscrew the cap from the bottle
[58]

KNOWN lift upward— Lift the cap vertically to ensure it is completely detached from the bottle threads
[59]

KNOWN open gripper— Open the gripper to drop the detached cap onto the workspace
[60]

KNOWN return to home— Execute the mandatory hardware reset to return the robot to its canon- ical home pose
[61]

KNOWN move gripper to the side of the yellow bottle body— Move the gripper to a side- approach position relative to the bottle body
[62]

KNOWN close gripper— Close the gripper to perform a side grasp on the now-uncapped bottle
[63]

KNOWN lift upward— Lift the bottle upward to clear the table for movement
[64]

KNOWN move gripper to the side of the bowl— Transport the bottle to the side of the bowl in preparation for pouring
[65]

KNOWN tilt bottle forward to pour— Tilt the bottle forward over the bowl to empty its contents
[66]

KNOWN tilt bottle back upright— Rotate the bottle back to a vertical, upright orientation
[67]

KNOWN lower gripper— Lower the bottle back down to the table surface
[68]

KNOWN open gripper— Open the gripper to release the bottle. 19

[1] [1]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An Open-Source Vision-Language-Action Model, June

[2] [2]

arXiv:2406.09246 [cs]

URLhttp://arxiv.org/abs/2406.09246. arXiv:2406.09246 [cs]

Pith/arXiv arXiv

[3] [3]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

Pith/arXiv arXiv 2025

[4] [4]

Bjorck, F

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhan...

Pith/arXiv arXiv 2025

[5] [5]

NASA’s InSight Waits Out Dust Storm - NASA, Oct. 2022. URLhttps://www.nasa.gov/ missions/insight/nasas-insight-waits-out-dust-storm/. Section: InSight (Inte- rior Exploration using Seismic Investigations, Geodesy and Heat Transport)

2022

[6] [6]

Kalashnikov, A

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, and S. Levine. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018. URLhttps://arxiv.org/abs/1806.10293

Pith/arXiv arXiv 2018

[7] [7]

Wagenmaker, M

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering Your Diffusion Policy with Latent Space Reinforcement Learning, June

[8] [8]

URLhttps://arxiv.org/abs/2506.15799v2

Pith/arXiv arXiv

[9] [9]

Z. Gu, M. Yang, D. Zou, and D. Xu. Learning Diffusion Policy from Primitive Skills for Robot Manipulation, Jan. 2026. URLhttp://arxiv.org/abs/2601.01948. arXiv:2601.01948 [cs]

arXiv 2026

[10] [10]

C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano- Pérez. Integrated Task and Motion Planning, Oct. 2020. URLhttp://arxiv.org/abs/ 2010.01083. arXiv:2010.01083 [cs.RO]

arXiv 2020

[11] [11]

W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierar- chical Control, Feb. 2026. URLhttp://arxiv.org/abs/2602.13193. arXiv:2602.13193 [cs]

Pith/arXiv arXiv 2026

[12] [12]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

L. Smith, A. Irpan, M. G. Arenas, S. Kirmani, D. Kalashnikov, D. Shah, and T. Xiao. STEER: Flexible Robotic Manipulation via Dense Language Grounding. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 16517–16524, May 2025. doi: 10.1109/ICRA55743.2025.11127404. URLhttps://ieeexplore.ieee.org/document/ 11127404/

work page doi:10.1109/icra55743.2025.11127404 2025

[13] [13]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as Policies: Language Model Programs for Embodied Control, May 2023. URLhttp://arxiv. org/abs/2209.07753. arXiv:2209.07753 [cs]

Pith/arXiv arXiv 2023

[14] [14]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. 10 Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, K.-H. Lee, S. Levine, Y . Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Re...

[15] [15]

URLhttps://arxiv.org/abs/2204.01691v2

Pith/arXiv arXiv

[16] [16]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, Nov. 2023. URLhttp://arxiv. org/abs/2307.05973. arXiv:2307.05973 [cs.RO]

Pith/arXiv arXiv 2023

[17] [17]

M. Fu, J. Yu, K. El-Refai, E. Kou, H. Xue, H. Huang, W. Xiao, G. Wang, F.-F. Li, G. Shi, J. Wu, S. Sastry, Y . Zhu, K. Goldberg, and L. J. Fan. CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation, Mar. 2026. URLhttp://arxiv. org/abs/2603.22435. arXiv:2603.22435 [cs]

arXiv 2026

[18] [18]

Intelligence, B

P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokin- sky, S. Cao, T. Charbonnier, V . Choudhary, F. Collins, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, M. Dhaka, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glos- sop, T. Godden, I. Goryachev, L. Groom, H. Habeeb, H. Hancock, K. Hausman, G. H...

Pith/arXiv arXiv 2026

[19] [19]

S. Liu, I. S. Singh, Y . Xu, J. Duan, and R. Krishna. VLS: Steering Pretrained Robot Poli- cies via Vision-Language Models, Feb. 2026. URLhttp://arxiv.org/abs/2602.03973. arXiv:2602.03973 [cs]

arXiv 2026

[20] [20]

N. B. Gutierrez, J. M. Cloud, and W. J. Beksi. Movement primitives in robotics: A compre- hensive survey, 2026. URLhttps://arxiv.org/abs/2601.02379

Pith/arXiv arXiv 2026

[21] [21]

B. Lee, Y . Lee, S. Kim, M. Son, and F. C. Park. Equivariant Motion Manifold Primitives. In Proceedings of The 7th Conference on Robot Learning, pages 1199–1221. PMLR, Dec. 2023. URLhttps://proceedings.mlr.press/v229/lee23a.html

2023

[22] [22]

W. Liu, N. Nie, R. Zhang, J. Mao, and J. Wu. Learning Compositional Behaviors from Demon- stration and Language, 2025. URLhttps://arxiv.org/abs/2505.21981. Version Num- ber: 1

arXiv 2025

[23] [23]

Y . Zhu, P. Stone, and Y . Zhu. Bottom-Up Skill Discovery from Unsegmented Demonstrations for Long-Horizon Robot Manipulation, Jan. 2022. URLhttp://arxiv.org/abs/2109. 13841. arXiv:2109.13841 [cs]

arXiv 2022

[24] [24]

A. Adeniji. Learning Representations for Unsupervised Skill Discovery. 2024. URLhttps: //purl.stanford.edu/sb108vw6601

2024

[25] [25]

Cathomen, M

R. Cathomen, M. Mittal, M. Vlastelica, and M. Hutter. Divide, Discover, Deploy: Factorized Skill Learning with Symmetry and Style Priors. 2025

2025

[26] [26]

N. Nie, W. Huang, J. Mao, L. Fei-Fei, W. Liu, and J. Wu. Learning composable skills by discovering spatial and temporal structure with foundation models. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026

[27] [27]

L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, A. Li-Bell, D. Driess, L. Groom, S. Levine, and C. Finn. Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models, Feb. 2025. URL https://arxiv.org/abs/2502.19417v2

Pith/arXiv arXiv 2025

[28] [28]

C. Xu, Q. Li, J. Luo, and S. Levine. RLDG: Robotic Generalist Policy Distillation via Rein- forcement Learning, Dec. 2024. URLhttps://arxiv.org/abs/2412.09858v1. 11

arXiv 2024

[29] [29]

Zhang, Y

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang. ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations, Sept. 2025. URLhttp://arxiv.org/abs/2505.10911. arXiv:2505.10911 [cs]

arXiv 2025

[30] [30]

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-Level Reward Design via Coding Large Language Models, Apr. 2024. URLhttp://arxiv.org/abs/2310.12931. arXiv:2310.12931 [cs]

Pith/arXiv arXiv 2024

[31] [31]

X. Zhao, C. Weber, and S. Wermter. Agentic Skill Discovery, Aug. 2024. URLhttp:// arxiv.org/abs/2405.15019. arXiv:2405.15019 [cs]

arXiv 2024

[32] [32]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mim- icGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations. 2023

2023

[33] [33]

J. Duan, W. Yuan, W. Pumacay, Y . R. Wang, K. Ehsani, D. Fox, and R. Krishna. Manipulate- Anything: Automating Real-World Robots using Vision-Language Models, Aug. 2024. URL http://arxiv.org/abs/2406.18915. arXiv:2406.18915 [cs.RO]

arXiv 2024

[34] [34]

H. Ha, P. Florence, and S. Song. Scaling Up and Distilling Down: Language- Guided Robot Skill Acquisition, Oct. 2023. URLhttp://arxiv.org/abs/2307.14535. arXiv:2307.14535 [cs]

arXiv 2023

[35] [35]

Cheng, Z

S. Cheng, Z. Li, K. Yu, and D. Xu. Continual Robot Learning via Language-Guided Skill Acquisition. 2025

2025

[36] [36]

Y . Wu, G. Wang, Z. Yang, M. Yao, B. Sheil, and H. Wang. Continually Evolving Skill Knowledge in Vision Language Action Model, 2025. URLhttps://arxiv.org/abs/2511. 18085. Version Number: 2

2025

[37] [37]

X. Wang, Z. Han, Z. Liu, G. Li, J. Dong, B. Liu, L. Liu, and Z. Han. Lifelong Language- Conditioned Robotic Manipulation Learning, Mar. 2026. URLhttp://arxiv.org/abs/ 2603.05160. arXiv:2603.05160 [cs.RO]

arXiv 2026

[38] [38]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low- rank adaptation of large language models, 2021. URLhttps://arxiv.org/abs/2106. 09685

2021

[39] [39]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning, Oct. 2023. URLhttp://arxiv.org/ abs/2306.03310. arXiv:2306.03310 [cs]

Pith/arXiv arXiv 2023

[40] [40]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Jan. 2018. URLhttps: //arxiv.org/abs/1801.01290v2

Pith/arXiv arXiv 2018

[41] [41]

S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning, Sept. 2025. URLhttps://arxiv.org/abs/2509.15937v1

arXiv 2025

[42] [42]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Cheb- otar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang. World Action M...

Pith/arXiv arXiv 2026

[43] [43]

B. Hou, G. Li, J. Jia, T. An, X. Guo, S. Leng, H. Geng, Y . Ze, T. Harada, P. Torr, O. Mees, M. Pollefeys, Z. Liu, J. Wu, P. Abbeel, J. Malik, Y . Du, and J. Yang. World Model for Robot Learning: A Comprehensive Survey, Apr. 2026. URLhttp://arxiv.org/abs/ 2605.00080. arXiv:2605.00080 [cs]

Pith/arXiv arXiv 2026

[44] [44]

Gemini 3: Advancing multimodal intelligence, agentic workflows, and deep reasoning

Gemini Team, Google DeepMind. Gemini 3: Advancing multimodal intelligence, agentic workflows, and deep reasoning. Technical report, Google DeepMind, 2025. URLhttps: //deepmind.google/technologies/gemini. 12 A Implementation Details We use theπ 0.5 VLA [2] in our experiments, although INSIGHTis agnostic to the underlying VLA. We fine-tune with LoRA [35] (G...

2025

[45] [45]

Use existing primitives for every sub-step they cover -- a skill gap should only be the novel part, not a bundle of existing + novel actions

Break the goal into fine-grained steps. Use existing primitives for every sub-step they cover -- a skill gap should only be the novel part, not a bundle of existing + novel actions

[46] [46]

If an existing primitive could achieve the same result (even if executed differently), use it and put execution details in step_notes instead

Only create a skill gap when the desired outcome is fundamentally different from what any existing primitive produces. If an existing primitive could achieve the same result (even if executed differently), use it and put execution details in step_notes instead

[47] [47]

Every step goes in primitive_sequence -- including new ones

[48] [48]

New primitives also go in skill_gaps (must appear in BOTH lists)

[49] [49]

Name new primitives by their desired EFFECT, not the robot motion

[50] [50]

For each step, add a note on execution (approach, grasp, how it enables the next step)

[51] [51]

Each step should make a distinguishable contribution to the goal -- avoid adding a final step whose only effect is repositioning the gripper

After the final step, the runtime returns the gripper to a safe home pose, so the gripper does not need to be cleared from the workspace by a final step in the plan. Each step should make a distinguishable contribution to the goal -- avoid adding a final step whose only effect is repositioning the gripper

[52] [52]

move gripper to the red lego block

Each skill gap is one single-axis motion (one translation OR one rotation along one axis, in one direction). If the goal involves multiple distinct motions, create a separate skill gap for each. Example 1 -- pick and place (all existing, no skill gaps): primitive_sequence: ["move gripper to the red lego block", "close gripper", "lift upward", "move grippe...

[53] [53]

Never select drz for any motion that requires an object to tip over, invert, or pivot its top towards a target; drz only spins the object on its own axis

[54] [54]

current_state

The wrist camera moves with the gripper; its local axes are independent of the global room frame. Never select an axis based on where a target appears to sit (left, right, up, down) in IMAGE 1. Map the required tilt strictly to the local structure of the gripper fingers in IMAGE 2. BE AW ARE: Depth and gripper biases may exist due to the close-up wrist vi...

[55] [55]

KNOWN move gripper above the yellow bottle cap— Move the gripper into a top-down approach position centered over the yellow cap

[56] [56]

KNOWN close gripper— Close the gripper to secure a firm grasp on the cap

[57] [57]

KNOWN twist open the cap— Perform a 180-degree counterclockwise rotation to unscrew the cap from the bottle

[58] [58]

KNOWN lift upward— Lift the cap vertically to ensure it is completely detached from the bottle threads

[59] [59]

KNOWN open gripper— Open the gripper to drop the detached cap onto the workspace

[60] [60]

KNOWN return to home— Execute the mandatory hardware reset to return the robot to its canon- ical home pose

[61] [61]

KNOWN move gripper to the side of the yellow bottle body— Move the gripper to a side- approach position relative to the bottle body

[62] [62]

KNOWN close gripper— Close the gripper to perform a side grasp on the now-uncapped bottle

[63] [63]

KNOWN lift upward— Lift the bottle upward to clear the table for movement

[64] [64]

KNOWN move gripper to the side of the bowl— Transport the bottle to the side of the bowl in preparation for pouring

[65] [65]

KNOWN tilt bottle forward to pour— Tilt the bottle forward over the bowl to empty its contents

[66] [66]

KNOWN tilt bottle back upright— Rotate the bottle back to a vertical, upright orientation

[67] [67]

KNOWN lower gripper— Lower the bottle back down to the table surface

[68] [68]

KNOWN open gripper— Open the gripper to release the bottle. 19