pith. sign in

arxiv: 2606.12366 · v1 · pith:2NNFWYX6new · submitted 2026-06-10 · 💻 cs.RO

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

Pith reviewed 2026-06-27 09:47 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelsaction expert pretraininginstruction generalizationvision-action priorgated fusionout-of-distribution instructionspolicy factorization
0
0 comments X

The pith

Pretraining the action expert on vision-action pairs before language integration improves generalization of vision-language-action policies to unseen instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that structural imbalance in vision-language-action data, with language far less diverse than visuals and actions, leads policies to rely on visual shortcuts and generalize poorly to new instructions. It addresses this by factorizing the policy into a language-agnostic vision-action prior and a language-conditioned likelihood. The first stage pretrains the action expert solely on vision-action pairs drawn from a frozen vision-language model, bypassing language imbalance. The second stage injects language through gated fusion to integrate features while preserving the visuomotor prior. This two-stage process yields better results on out-of-distribution instructions and compositional tasks.

Core claim

APT factorizes the policy into a language-agnostic vision-action prior and a language-conditioned VLA likelihood; the action expert is pretrained as the prior on vision-action pairs from a frozen VLM, after which language tokens are injected via gated fusion to produce the full policy without corrupting the prior or the VLM's language capability.

What carries the argument

The two-stage training process that pretrains the action expert as a vision-action prior on frozen-VLM pairs, followed by gated fusion to incorporate language tokens.

If this is right

  • APT produces consistent gains on unseen instructions.
  • APT improves performance on compositional tasks.
  • The method applies to mainstream VLA architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pretraining approach could lower the amount of diverse language data required to train capable VLA policies.
  • Similar modality-specific pretraining might help other multimodal control settings where one input type dominates the data distribution.
  • Evaluating the learned prior on physical robots with novel verbal commands would test whether the gains transfer beyond simulation.

Load-bearing premise

Pretraining the action expert solely on vision-action pairs produces a language-agnostic prior that the gated fusion stage can integrate without losing the VLM's language capability or introducing new imbalances.

What would settle it

An ablation that removes the pretraining stage, trains the action expert from random initialization on the full imbalanced dataset, and measures equivalent or superior generalization to unseen instructions would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.12366 by Anzhe Chen, Kechun Xu, Rong Xiong, Yue Wang, Zhenjie Zhu.

Figure 1
Figure 1. Figure 1: Action expert pretraining (APT) enables effective instruction following. Enabling robots to follow diverse task instruc￾tions across varied environments is a long￾standing goal of generalizable robot policies. Vision-Language-Action (VLA) models have emerged as a promising paradigm toward this goal, leveraging pretrained Vision-Language Models (VLMs) to ground language instruc￾tions in visual observations … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of APT. In Stage 1, the action expert is pretrained as a VA prior conditioned [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Action Expert Design. VLM features are in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Action expert pretraining applies to diverse architectures. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on large-scale pretraining and language injection mechanism. Effectiveness of Large-Scale Pretraining. To disentangle large-scale pretraining, we evaluate a variant applying two-stage training only on task-specific datasets, i.e., without any pretrain￾ing on large-scale datasets (w/o Pretraining). As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Real-world cases. (a) pick-place task, (b) clutter pick-place [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Action representation. We use relative end-effector poses projected in the camera frame, [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Attention masks of the two-stage training. Stage 1 uses PRoPE layers over vision and [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overview of the simulation benchmarks. a different object in the same scene, forming an unseen task for OOD language generalization. Notably, simply replicating training trajectories fails under both perturbations, preventing dataset￾level shortcuts. Rigid Object Pick-Place. Following [6], we evaluate on a simulation benchmark built in Isaac￾Sim [78] for diverse language-conditioned pick-and-place tasks u… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative case studies on rigid object pick-place across the four evaluation settings. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Real-world platform, containers, objects, and background variations used in our real [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Real-world task and generalization setting overview. Top: single pick-place task with [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: More real-world cases comparing π0.5 and APT. (a, b) Grasping on unseen objects. (c, d) Sub-task transition in clutter pick-place. (e, f) Compositional task execution. Red dotted lines visualize end-effector trajectories; annotated text highlights the failure cause for π0.5. Start Clear clutter around target Grasp the target object Place into target container Target object inside the target container 10 8… view at source ↗
Figure 15
Figure 15. Figure 15: Failure breakdown on the UOUE setting of the clutter pick-place task. The Sankey [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Typical failure cases of APT. Top: continued pushing after successfully grasping the tar [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the $\pi$ and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes APT, a two-stage training method for Vision-Language-Action (VLA) policies that factorizes the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood. In Stage 1 the action expert is pretrained on vision-action pairs from a frozen VLM; in Stage 2 language tokens are injected via a gated fusion mechanism. The approach is claimed to apply to mainstream architectures including π and GR00T-style models and to deliver consistent gains on unseen instructions and compositional tasks by mitigating language imbalance in the training data.

Significance. If the experimental gains hold, the method supplies a practical Bayesian-motivated factorization that isolates visuomotor pretraining from language conditioning, potentially allowing better exploitation of pretrained VLMs without noisy gradients corrupting language capability. This addresses a recognized structural imbalance in continuous-action VLA data and could improve OOD instruction generalization across existing architectures.

major comments (1)
  1. [Abstract] Abstract: the central claim of 'consistent gains on unseen instructions and compositional tasks' rests on 'comprehensive experiments,' yet the manuscript provides no information on dataset sizes, baseline comparisons, error bars, ablation controls, or statistical significance, rendering it impossible to verify whether the reported improvements are supported by the data or affected by post-hoc choices.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback. The concern about insufficient experimental transparency in the abstract is valid, and we address it directly below while committing to revisions that improve verifiability without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'consistent gains on unseen instructions and compositional tasks' rests on 'comprehensive experiments,' yet the manuscript provides no information on dataset sizes, baseline comparisons, error bars, ablation controls, or statistical significance, rendering it impossible to verify whether the reported improvements are supported by the data or affected by post-hoc choices.

    Authors: We agree the abstract is overly concise and omits key experimental metadata, which limits immediate verification. The manuscript body details the datasets (including sizes and sources), baseline architectures and implementations, ablation studies isolating the pretraining and fusion components, and results aggregated over multiple random seeds. We will revise the abstract to explicitly state dataset sizes, number of baselines evaluated, presence of ablations, and that reported gains include error bars. If statistical significance tests (e.g., paired t-tests) are absent from the current results, we will add them in the revision to strengthen the evidence for consistent OOD gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes a two-stage training procedure (pretrain action expert on vision-action pairs with frozen VLM, then gated language fusion) as an independent methodological fix for data imbalance. No equations, fitted parameters renamed as predictions, or self-citations appear as load-bearing elements in the derivation chain. The Bayesian factorization is presented as a modeling choice rather than a derived result that reduces to its inputs. The approach is self-contained against external benchmarks via described experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract provides only high-level description; the Bayesian factorization and gated fusion are introduced without quantitative details on any fitted values or additional assumptions.

axioms (1)
  • domain assumption The policy distribution can be factorized into a language-agnostic vision-action prior and a language-conditioned VLA likelihood.
    Invoked to motivate the two-stage training; stated directly in the abstract.
invented entities (1)
  • gated fusion mechanism no independent evidence
    purpose: Integrates VLM language features into the pretrained action expert while preserving the visuomotor prior.
    New component introduced in Stage 2; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5791 in / 1290 out tokens · 17462 ms · 2026-06-27T09:47:59.519329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning Action Priors for Cross-embodiment Robot Manipulation

    cs.RO 2026-06 unverdicted novelty 5.0

    A two-stage framework pretrains an action module with temporal motion priors from unconditioned trajectories using flow-matching, then transfers it to VLA training via decoder reuse and distillation, yielding better p...

Reference graph

Works this paper leans on

80 extracted references · 30 linked inside Pith · cited by 1 Pith paper

  1. [1]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  2. [2]

    Bjorck, F

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  3. [3]

    G. A. Team. Gen-0: Embodied foundation models that scale with physical interaction.Gener- alist AI Blog, 2025. https://generalistai.com/blog/preview-uqlxvb-bb.html

  4. [4]

    X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: To- wards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

  5. [5]

    J. Gao, S. Belkhale, S. Dasari, A. Balakrishna, D. Shah, and D. Sadigh. A taxonomy for evaluating generalist robot policies.arXiv preprint arXiv:2503.01238, 2025

  6. [6]

    K. Xu, Z. Zhu, A. Chen, S. Zhao, Q. Huang, Y . Yang, H. Lu, R. Xiong, M. Tomizuka, and Y . Wang. Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025

  7. [7]

    Y . Fang, Y . Feng, D. Jing, J. Liu, Y . Yang, Z. Wei, D. Szafir, and M. Ding. When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026

  8. [8]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning (CoRL), pages 2165–2183. PMLR, 2023

  9. [9]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning (CoRL), pages 2679–2713, 2024

  10. [10]

    Pertsch, K

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  11. [11]

    G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  12. [12]

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot- vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1702– 1713, 2025. 9

  13. [13]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. RDT-1B: A dif- fusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations (ICLR), 2025

  14. [14]

    J. Wen, Y . Zhu, M. Zhu, Z. Tang, J. Li, Z. Zhou, X. Liu, C. Shen, Y . Peng, and F. Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In International Conference on Machine Learning (ICML), pages 66558–66574. PMLR, 2025

  15. [15]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  16. [16]

    Driess, J

    D. Driess, J. T. Springenberg, L. Yu, A. Li-Bell, K. Pertsch, A. Z. Ren, H. Walke, Q. Vuong, L. X. Shi, S. Levine, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  17. [17]

    Bjorck, F

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1.5: An improved open foundation model for generalist humanoid robots. 2025

  18. [18]

    Zhong, F

    Y . Zhong, F. Bai, S. Cai, X. Huang, Z. Chen, X. Zhang, Y . Wang, S. Guo, T. Guan, K. N. Lui, et al. A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925, 2025

  19. [19]

    X. Li, P. Li, L. Qian, M. Liu, D. Wang, J. Liu, B. Kang, X. Ma, X. Wang, D. Guo, et al. What matters in building vision-language-action models for generalist robots.Nature Machine Intelligence, pages 1–15, 2026

  20. [20]

    C. Cui, P. Ding, W. Song, S. Bai, X. Tong, Z. Ge, R. Suo, W. Zhou, Y . Liu, B. Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912, 2025

  21. [21]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. RT-1: Robotics transformer for real-world control at scale. Robotics: Science and Systems (RSS), 2023

  22. [22]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  23. [23]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems (RSS), 2023

  24. [24]

    Jiang, A

    Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan. Vima: Robot manipulation with multimodal prompts. InInternational Conference on Machine Learning (ICML), pages 14975–15022, 2023

  25. [25]

    O. X.-E. Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023

  26. [26]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems (RSS), 2024

  27. [27]

    H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. InIEEE International Conference on Robotics and Automation (ICRA), pages 653–660. IEEE, 2024. 10

  28. [28]

    K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y . Zhao, Z. Xu, G. Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

  29. [29]

    Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  30. [30]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  31. [31]

    H. Geng, F. Wang, S. Wei, Y . Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y .-J. Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

  32. [32]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  33. [33]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  34. [34]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  35. [35]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

  36. [36]

    M. Zhu, Y . Zhu, J. Li, J. Wen, Z. Xu, N. Liu, R. Cheng, C. Shen, Y . Peng, F. Feng, et al. Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation. InIEEE Inter- national Conference on Robotics and Automation (ICRA), pages 10838–10845. IEEE, 2025

  37. [37]

    J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu, et al. Hy- bridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

  38. [38]

    Q. Bu, H. Li, L. Chen, J. Cai, J. Zeng, H. Cui, M. Yao, and Y . Qiao. Towards synergistic, gen- eralized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001, 2024

  39. [39]

    Shukor, D

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  40. [40]

    Jiang, S

    Y . Jiang, S. Huang, S. Xue, Y . Zhao, J. Cen, S. Leng, K. Li, J. Guo, K. Wang, M. Chen, et al. Rynnvla-001: Using human demonstrations to improve robot manipulation.arXiv preprint arXiv:2509.15212, 2025

  41. [41]

    W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  42. [42]

    J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  43. [43]

    Zhang, H

    W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 11

  44. [44]

    J. Sun, W. Zhang, Z. Qi, S. Ren, Z. Liu, H. Zhu, G. Sun, X. Jin, and Z. Chen. VLA- JEPA: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

  45. [45]

    Intelligence, B

    P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokin- sky, S. Cao, T. Charbonnier, et al.π 0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

  46. [46]

    L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  47. [47]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:44776–44791, 2023

  48. [48]

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

  49. [49]

    Nasiriany, A

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

  50. [50]

    J. Guo, Z. Wu, C. Tu, Y . Ma, X. Kong, Z. Liu, J. Ji, S. Zhang, Y . Chen, K. Chen, et al. On robustness of vision-language-action model against multi-modal perturbations.arXiv preprint arXiv:2510.00037, 2025

  51. [51]

    Orjuela et al

    S. Orjuela et al. Robust skills, brittle grounding: Diagnosing restricted generalization in vision- language action policies via multi-object picking.arXiv preprint arXiv:2602.24143, 2026

  52. [52]

    S. L. Wanna, A. Luhtaru, R. Barron, J. Salfity, J. Moore, C. Matuszek, and M. Pryor. Let’s talk about language! investigating linguistic diversity in embodied ai datasets. In1st Work- shop on Safely Leveraging Vision-Language Foundation Models in Robotics: Challenges and Opportunities

  53. [53]

    I. Fang, J. Zhang, S. Tong, and C. Feng. From intention to execution: Probing the generaliza- tion boundaries of vision-language-action models.arXiv preprint arXiv:2506.09930, 2025

  54. [54]

    S. Lian, B. Yu, X. Lin, L. T. Yang, Z. Shen, C. Wu, Y . Miao, C. Huang, and K. Chen. Lang- Force: Bayesian decomposition of vision language action models via latent action queries. arXiv preprint arXiv:2601.15197, 2026

  55. [55]

    S. Yang, H. Li, Y . Chen, B. Wang, Y . Tian, T. Wang, H. Wang, F. Zhao, Y . Liao, and J. Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520, 2025

  56. [56]

    Cheang, S

    C. Cheang, S. Chen, Z. Cui, Y . Hu, L. Huang, T. Kong, H. Li, Y . Li, Y . Liu, X. Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

  57. [57]

    Huang, F

    H. Huang, F. Liu, L. Fu, T. Wu, M. Mukadam, J. Malik, K. Goldberg, and P. Abbeel. Otter: A vision-language-action model with text-aware visual feature extraction. InInternational Conference on Machine Learning (ICML)

  58. [58]

    Nakamoto, O

    M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance. InConference on Robot Learning (CoRL), pages 4996–5013. PMLR, 2025

  59. [59]

    Y . Wu, R. Tian, G. Swamy, and A. Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment.arXiv preprint arXiv:2502.01828, 2025. 12

  60. [60]

    Zhang, C

    Y . Zhang, C. Wang, O. Lu, Y . Zhao, Y . Ge, Z. Sun, X. Li, C. Zhang, C. Bai, and X. Li. Align- then-steer: Adapting the vision-language action models through unified latent guidance.arXiv preprint arXiv:2509.02055, 2025

  61. [61]

    Z. Zhan, Y . Chen, J. Zhou, Q. Lv, H. Liu, K. Wang, L. Lin, and G. Wang. Stable language guidance for vision-language-action models.arXiv preprint arXiv:2601.04052, 2026

  62. [62]

    Perez, F

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 32, pages 3942–3951, 2018

  63. [63]

    M. Fu, J. Yu, K. El-Refai, E. Kou, H. Xue, H. Huang, W. Xiao, G. Wang, F.-F. Li, G. Shi, et al. Cap-x: A framework for benchmarking and improving coding agents for robot manipulation. arXiv preprint arXiv:2603.22435, 2026

  64. [64]

    Z. Xu, Z. He, J. Wu, and S. Song. Learning 3d dynamic scene representations for robot manipulation. InConference on Robot Learning (CoRL), pages 126–142. PMLR, 2021

  65. [65]

    Z. Chen, Q. Yan, Y . Chen, T. Wu, J. Zhang, Z. Ding, J. Li, Y . Yang, and H. Dong. ClutterDex- Grasp: A Sim-to-Real system for general dexterous grasping in cluttered scenes. InConference on Robot Learning (CoRL), pages 885–905. PMLR, 2025

  66. [66]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5738–5746, 2019

  67. [67]

    A. Chen, Y . Yang, Z. Zhu, K. Xu, Z. Zhou, R. Xiong, and Y . Wang. Toward embodiment equivariant vision-language-action policy.arXiv preprint arXiv:2509.14630, 2025

  68. [68]

    R. Li, B. Yi, J. Liu, H. Gao, Y . Ma, and A. Kanazawa. Cameras as relative positional encoding. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  69. [69]

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  70. [70]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020

  71. [71]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Con- ference on Learning Representations

  72. [72]

    Y . Tian, Y . Yang, Y . Xie, Z. Cai, X. Shi, N. Gao, H. Liu, X. Jiang, Z. Qiu, F. Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 976–985, 2026

  73. [73]

    X. Chen, Y . Chen, Y . Fu, N. Gao, J. Jia, W. Jin, H. Li, Y . Mu, J. Pang, Y . Qiao, et al. Internvla- m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

  74. [74]

    Karamcheti, S

    S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InInternational Con- ference on Machine Learning (ICML), pages 23123–23144, 2024

  75. [75]

    Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  76. [76]

    S. Tan, K. Dou, Y . Zhao, and P. Kr ¨ahenb¨uhl. Interactive post-training for vision-language- action models.arXiv preprint arXiv:2505.17016, 2025. 13

  77. [77]

    Zheng, J

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

  78. [78]

    Isaac Sim

    NVIDIA. Isaac Sim. URLhttps://github.com/isaac-sim/IsaacSim

  79. [79]

    K. Gao, D. Lau, B. Huang, K. E. Bekris, and J. Yu. Fast high-quality tabletop rearrangement in bounded workspace. InIEEE International Conference on Robotics and Automation (ICRA), pages 1961–1967. IEEE, 2022

  80. [80]

    verb substitution

    K. Xu, H. Yu, Q. Lai, Y . Wang, and R. Xiong. Efficient learning of goal-oriented push-grasping synergy in clutter.IEEE Robotics and Automation Letters, 6(4):6337–6344, 2021. A Implementation Details Action Representation.Actions are defined on the SE(3) manifold: 3D translation, 6D continu- ous rotation [66], and normalized gripper width (−1fully closed,...