pith. sign in

arxiv: 2606.21496 · v1 · pith:WSQS5A6Onew · submitted 2026-06-19 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Decoupling the Declarative from the Procedural in Vision-Language-Action Models

Pith reviewed 2026-06-26 14:28 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords vision-language-action modelszero-shot transferdeclarative knowledgeprocedural knowledgerobot manipulationmodular architecturebehavior cloninggeneralization
0
0 comments X

The pith

Restructuring information flow in vision-language-action models decouples declarative from procedural knowledge for zero-shot object transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language-action models entangle knowledge about object concepts with knowledge about how to perform actions, which limits their ability to apply learned skills to new objects. The paper introduces w²VLA, which changes the flow by modulating the robot state sequence with separate visual, spatial, and skill information instead of routing all tokens through one large transformer action expert. This modular restructuring is meant to separate the two knowledge types inside the model parameters. If correct, a policy trained on demonstrations with one set of objects could execute the same skills on dissimilar unseen objects without further training, cutting down the data needed for generalist robots.

Core claim

By modulating the robot state sequence with visual, spatial, and skill information in a compositional and interpretable manner rather than feeding all multimodal tokens from the VLM encoder into a large transformer-based action expert, the w²VLA model decouples declarative knowledge of concepts and entity semantics from procedural knowledge of how to perform actions, resulting in robust behavior cloning and zero-shot skill transfer to dissimilar unseen objects.

What carries the argument

The restructured information flow in w²VLA that modulates the robot state sequence with visual, spatial, and skill information in a compositional manner.

If this is right

  • Policies achieve robust behavior cloning from object-specific demonstrations.
  • Zero-shot skill transfer becomes possible across dissimilar unseen objects.
  • The model handles spatial, semantic, and task variations more reliably than standard VLAs.
  • Knowledge representations become more compositional and interpretable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of information types could reduce the volume of robot demonstration data needed for new tasks.
  • Explicit modulation of state sequences might generalize to other sequential decision systems that combine perception and action.
  • Testing whether internal activations show clearer separation between semantic and motor features would directly probe the claimed decoupling.

Load-bearing premise

That modulating the robot state sequence with separate visual, spatial, and skill information will cause the learned parameters to separate declarative knowledge from procedural knowledge.

What would settle it

Train the model on behavior demonstrations with one set of objects then test whether the same skills succeed at high rates on a set of dissimilar unseen objects with no additional training or fine-tuning.

Figures

Figures reproduced from arXiv: 2606.21496 by Alexandros Kouris, Andreas Sochopoulos, Chris Xiaoxuan Lu, Nikolaos Tsagkas, Oisin Mac Aodha.

Figure 1
Figure 1. Figure 1: Skill transfer example. (a) Three VLAs (i.e., π0.5, OTTER, and our w2VLA) are trained on a dataset of two (skill, object) pairs: (rotate 90o , carrot ) and (place 5cm back, banana ). (b) We evaluate the three VLAs in a skill transfer scenario (rotate 90o , banana ). Unlike our w2VLA, π0.5 and OTTER fail to transfer the learned skill to the other object. (c) Summary of our main experimental results: all thr… view at source ↗
Figure 2
Figure 2. Figure 2: We compare two popular VLA paradigms against our w [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: w 2VLA architecture. Our policy sequentially modulates a sequence of hidden robot state tokens with the decoupled where (i.e., location of interest) and what (i.e., skill to be executed) information of the task. Visual tokens are extracted from a VFM encoder to inform the hidden states of relevant visual cues in the scene. Where (spatial) tokens are computed via attention heatmaps that localize the object … view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of all (skill, object) pairs (detailed information in Appendix A.2). object’s appearance and instructed task, successfully decoupling the spatial (where) information from the remaining components of skill execution. Skill Conditioning (i.e., the what): To condition the policy on the behavior intent of the task (e.g., “pick”, “push”), we utilize the skill embedding eskill. As the semantic goal… view at source ↗
Figure 5
Figure 5. Figure 5: Breakdown of average performance scores: (a) from OTTER, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Deploying generalist robotic agents in the real world requires transferable skills. Specifically, a policy trained to clone a behavior from object-specific demonstrations must generalize beyond that object, otherwise data collection requirements become intractable. Recently, fine-tuning of pre-trained billion-parameter Vision-Language Models (VLMs), initially on large-scale robot datasets and then on fewer scenario-specific demonstrations, has emerged as the predominant paradigm for designing Vision-Language-Action (VLA) models. While these policies achieve state-of-the-art manipulation performance in-distribution, they remain brittle to minor spatial, semantic, and task variations. In this work, we address the inability of current models to decouple the declarative (i.e., concepts and entity semantics) from the procedural knowledge (i.e., how to do something) encoded in their parameters, which is a fundamental bottleneck for zero-shot skill transfer to novel objects. To address this, we propose w$^{2}$VLA, a new VLA model with restructured information flow. Rather than feeding all multimodal tokens from the VLM encoder into a large, opaque transformer-based action expert, our approach modulates the robot state sequence with visual, spatial, and skill information in a compositional and interpretable manner. Unlike popular, state-of-the-art VLAs, we show that our modular approach successfully decouples knowledge representations, enabling robust behavior cloning and unprecedented zero-shot skill transfer capabilities across dissimilar, unseen objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes w²VLA, a Vision-Language-Action model that restructures information flow by modulating the robot state sequence with visual, spatial, and skill information in a compositional manner, rather than feeding all multimodal tokens into a large transformer action expert. It claims this modular design decouples declarative (entity semantics) from procedural (action execution) knowledge in the parameters, enabling robust behavior cloning and unprecedented zero-shot skill transfer to dissimilar unseen objects.

Significance. If the decoupling claim and transfer results hold with supporting evidence, the work would address a central limitation of current VLA models (brittleness to object variations) and could reduce data collection needs for new scenarios. The interpretable modulation approach offers a potential alternative to opaque fine-tuning of billion-parameter VLMs.

major comments (2)
  1. [Abstract] Abstract: The manuscript asserts empirical success ('we show that our modular approach successfully decouples knowledge representations, enabling robust behavior cloning and unprecedented zero-shot skill transfer') but supplies no experimental setup, datasets, metrics, baselines, ablations, or quantitative results. The central claim therefore cannot be evaluated.
  2. [Abstract] Abstract: The design choice of modulating the robot state sequence is presented as sufficient to produce decoupling of declarative and procedural knowledge in the learned parameters, yet no diagnostic evidence (e.g., representation probing, weight analysis, or controlled transfer metrics) is referenced to substantiate that the modulation actually enforces the claimed separation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on the abstract and the evidence for our decoupling claim. We address each point below and indicate where revisions to the manuscript are appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript asserts empirical success ('we show that our modular approach successfully decouples knowledge representations, enabling robust behavior cloning and unprecedented zero-shot skill transfer') but supplies no experimental setup, datasets, metrics, baselines, ablations, or quantitative results. The central claim therefore cannot be evaluated.

    Authors: The abstract is a concise summary; the full manuscript provides the requested details. Section 4 describes the datasets (including robot demonstration collections for training and held-out novel objects), evaluation metrics (success rate for behavior cloning and zero-shot transfer), baselines (standard VLA fine-tuning approaches), and ablations on the modulation components. Quantitative results appear in Section 5 with tables and figures. We will revise the abstract to include a brief clause referencing the evaluation protocol for improved self-containment. revision: yes

  2. Referee: [Abstract] Abstract: The design choice of modulating the robot state sequence is presented as sufficient to produce decoupling of declarative and procedural knowledge in the learned parameters, yet no diagnostic evidence (e.g., representation probing, weight analysis, or controlled transfer metrics) is referenced to substantiate that the modulation actually enforces the claimed separation.

    Authors: The zero-shot transfer results to dissimilar unseen objects serve as the primary controlled evidence: success rates remain high only when declarative and procedural components are separated via modulation, while non-modular baselines fail. This is quantified in the transfer experiments. We acknowledge that explicit representation probing or weight analysis is not included; if the referee believes these would strengthen the claim, we can add them as additional diagnostics in a revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript presents an architectural proposal for a new VLA model (w²VLA) that restructures information flow by modulating robot state sequences rather than feeding all multimodal tokens into a transformer action expert. No equations, loss functions, fitted parameters, or derivation chains appear in the abstract or description that could reduce a claimed prediction or result to an input by construction. The central claim concerns empirical decoupling via a modular design, with no self-citation load-bearing steps, uniqueness theorems, or ansatzes invoked to justify the architecture. This is a standard case of a self-contained design proposal without mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no equations, training details, or parameter counts supplied. No free parameters, axioms, or invented entities can be extracted beyond the high-level model name.

invented entities (1)
  • w²VLA modular information flow no independent evidence
    purpose: decouple declarative from procedural knowledge in VLA parameters
    New architecture introduced in the abstract without prior independent evidence or falsifiable prediction outside the paper.

pith-pipeline@v0.9.1-grok · 5810 in / 1111 out tokens · 22296 ms · 2026-06-26T14:28:26.874287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 13 linked inside Pith

  1. [1]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alab- dulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv:2407.07726, 2024

  2. [2]

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv:2409.12191, 2024

  3. [3]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InInternational Conference on Robotics and Automation (ICRA), 2024

  4. [4]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRSS 2024 Workshop: Data Generation for Robotics, 2024

  5. [5]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. InConference on Robot Learning (CoRL), 2025

  6. [6]

    Bjorck, F

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv:2503.14734, 2025

  7. [7]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control. InRobotics: Science and S...

  8. [8]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world general- ization. InConference on Robot Learning (CoRL), 2025

  9. [9]

    Intelligence.π ∗ 0.6: a vla that learns from experience.arXiv:2511.14759, 2025

    P. Intelligence.π ∗ 0.6: a vla that learns from experience.arXiv:2511.14759, 2025. 9

  10. [10]

    Shukor, D

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv:2506.01844, 2025

  11. [11]

    Goyal, H

    A. Goyal, H. Hadfield, X. Yang, V . Blukis, and F. Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv:2510.13054, 2025

  12. [12]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems (NeurIPS), 2023

  13. [13]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv:2506.18088, 2025

  14. [14]

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv:2510.13626, 2025

  15. [15]

    X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv:2510.03827, 2025

  16. [16]

    Huang, F

    H. Huang, F. Liu, L. Fu, T. Wu, M. Mukadam, J. Malik, K. Goldberg, and P. Abbeel. Otter: A vision-language-action model with text-aware feature extraciton. InInternational Conference on Machine Learning (ICML), 2025

  17. [17]

    X. Yang, R. Dagli, A. Zook, H. Hadfield, A. Goyal, S. Birchfield, F. Ramos, and J. Trem- blay. Robolab: A high-fidelity simulation benchmark for analysis of task generalist policies. arXiv:2604.09860, 2026

  18. [18]

    Grover, A

    S. Grover, A. Gopalkrishnan, B. Ai, H. I. Christensen, H. Su, and X. Li. Enhancing generalization in vision-language-action models by preserving pretrained representations. arXiv:2509.11417, 2025

  19. [19]

    Chuang, Y

    Y .-S. Chuang, Y . Li, D. Wang, C.-F. Yeh, K. Lyu, R. Raghavendra, J. Glass, L. Huang, J. We- ston, L. Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe.Advances in Neural Information Processing Systems (NeurIPS), 2026

  20. [20]

    M. A. Goodale and A. Milner. Separate visual pathways for perception and action.Trends in Neurosciences, 1992

  21. [21]

    M. A. Goodale, A. D. Milner, L. S. Jakobson, and D. P. Carey. A neurological dissociation between perceiving objects and grasping them.Nature, 1991

  22. [22]

    Milner and M

    D. Milner and M. Goodale.The visual brain in action, volume 27. Oxford University Press, 2006

  23. [23]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipu- lation. InConference on Robot Learning (CoRL), 2021

  24. [24]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  25. [25]

    Tsagkas, A

    N. Tsagkas, A. Sochopoulos, D. Danier, S. Vijayakumar, A. Kouris, O. Mac Aodha, and C. X. Lu. Attentive feature aggregation or: How policies learn to stop worrying about robustness and attend to task-relevant visual cues.arXiv:2511.10762, 2025. 10

  26. [26]

    L. Fu, H. Huang, G. Datta, L. Y . Chen, W. C.-H. Panitch, F. Liu, H. Li, and K. Goldberg. In- context imitation learning via next-token prediction. InInternational Conference on Robotics and Automation (ICRA), 2025

  27. [27]

    Perez, F

    E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. Courville. Film: visual reasoning with a general conditioning layer. InAssociation for the Advancement of Artificial Intelligence (AAAI), 2018

  28. [28]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems (RSS), 2023

  29. [29]

    Xiong, Y

    R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning (ICML), 2020

  30. [30]

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick. Masked autoencoders are scalable vision learners. InComputer Vision and Pattern Recognition (CVPR), 2022

  31. [31]

    Cadene, S

    R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Ar- actingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/lerobot, 2024

  32. [32]

    A. Xie, L. Lee, T. Xiao, and C. Finn. Decomposing the generalization gap in imitation learning for visual robotic manipulation. InInternational Conference on Robotics and Automation (ICRA), 2024

  33. [33]

    Hansen, Z

    N. Hansen, Z. Yuan, Y . Ze, T. Mu, A. Rajeswaran, H. Su, H. Xu, and X. Wang. On pre- training for visuo-motor control: Revisiting a learning-from-scratch baseline. InInternation Conference on Machine Learning (ICML), 2023

  34. [34]

    Burns, Z

    K. Burns, Z. Witzel, J. I. Hamid, T. Yu, C. Finn, and K. Hausman. What makes pre-trained visual representations successful for robust manipulation? InConference on Robot Learning (CoRL), 2024

  35. [35]

    Houlsby, A

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly. Parameter-efficient transfer learning for NLP. InInternation Conference on Machine Learning (ICML), 2019

  36. [36]

    X. Lin, J. So, S. Mahalingam, F. Liu, and P. Abbeel. Spawnnet: Learning generalizable visuo- motor skills from pre-trained network. InInternational Conference on Robotics and Automa- tion (ICRA), 2024

  37. [37]

    Torne, K

    M. Torne, K. Pertsch, H. Walke, K. Vedder, S. Nair, B. Ichter, A. Z. Ren, H. Wang, J. Tang, K. Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv:2603.03596, 2026

  38. [38]

    L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language- action models. InInternational Conference on Machine Learning (ICML), 2025

  39. [39]

    W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control.arXiv:2602.13193, 2026

  40. [40]

    H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y . Xie, and C. Lu. Any- grasp: Robust and efficient grasp perception in spatial and temporal domains.Transactions on Robotics (T-RO), 2023. 11

  41. [41]

    Touvron, L

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Ba- tra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023

  42. [42]

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. InInternational Conference on Computer Vision (ICCV), 2023

  43. [43]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational Conference on Machine Learning (ICML), 2021

  44. [44]

    M. Lan, C. Chen, Y . Ke, X. Wang, L. Feng, and W. Zhang. Clearclip: Decomposing clip representations for dense vision-language inference. InEuropean Conference on Computer Vision (ECCV), 2024

  45. [45]

    Sundaresan, S

    P. Sundaresan, S. Belkhale, D. Sadigh, and J. Bohg. KITE: Keypoint-conditioned policies for semantic manipulation. InConference on Robot Learning (CoRL), 2023

  46. [46]

    Zhang, M

    J. Zhang, M. Memmel, K. Kim, D. Fox, J. Thomason, F. Ramos, E. Bıyık, A. Gupta, and A. Li. Peek: Guiding and minimal image representations for zero-shot generalization of robot manipulation policies. InInternational Conference on Robotics and Automation (ICRA), 2026

  47. [47]

    Huang, C

    W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConference on Robot Learning (CoRL), 2023

  48. [48]

    W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola. Distilled feature fields enable few-shot language-guided manipulation. InConference on Robot Learning (CoRL), 2023

  49. [49]

    Tsagkas, J

    N. Tsagkas, J. Rome, S. Ramamoorthy, O. Mac Aodha, and C. X. Lu. Click to grasp: Zero- shot precise manipulation via visual diffusion descriptors. InInternational Conference on Intelligent Robots and Systems (IROS), 2024

  50. [50]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning (CoRL), 2023

  51. [51]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems (RSS), 2024

  52. [52]

    Driess, J

    D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems (NeurIPS), 2026

  53. [53]

    X. Guo, B. Xie, W. Chai, X. Deng, T. Wang, Z. Wu, and X. Chen. Priorvla: Prior-preserving adaptation for vision-language-action models.arXiv:2605.10925, 2026

  54. [54]

    S. Chen, P. Pacaud, and C. Schmid. Pointact: Vision-language-action models with multi-scale point-action interaction.arXiv:2605.21414, 2026

  55. [55]

    Sochopoulos, N

    A. Sochopoulos, N. Malkin, N. Tsagkas, J. Moura, M. Gienger, and S. Vijayakumar. Fast flow-based visuomotor policies via conditional optimal transport couplings. InConference on Robot Learning (CoRL), 2025

  56. [56]

    poke” and “pour

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Confer- ence on Learning Representations (ICLR), 2019. 12 Appendix A.1 Related Work Vision-Language-Action Models. Inspired by the broad success that LLMs and VLMs have achieved by scaling both model size and training data, the robotics community has increasingly adopted t...