pith. sign in

arxiv: 2602.18532 · v2 · pith:HEPKW5KCnew · submitted 2026-02-20 · 💻 cs.CV · cs.AI· cs.RO

VLANeXt: Recipes for Building Strong VLA Models

Pith reviewed 2026-05-21 12:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords Vision-Language-Action modelsVLA design spacerobot policy learningLIBERO benchmarkunified evaluation frameworkaction modellingperception essentials
0
0 comments X

The pith

A controlled sweep of design choices in vision-language-action models yields a recipe that produces a stronger model called VLANeXt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to reduce confusion in the VLA field by holding training and evaluation conditions fixed while changing one design element at a time. It begins with a plain baseline similar to the earliest VLA models and then varies choices in three areas: which foundation model to start from, how perception is handled, and how actions are turned into outputs. Twelve practical findings emerge from this comparison and combine into VLANeXt. A sympathetic reader would care because a shared recipe could replace scattered trial-and-error with clearer steps toward reliable robot policies that work across many tasks.

Core claim

Under one shared training protocol and evaluation setup, systematic variation of foundational components, perception essentials, and action modelling perspectives produces twelve concrete findings. When these findings are combined, the resulting VLANeXt model surpasses prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and also succeeds in real-world robot experiments.

What carries the argument

The three-dimensional dissection of the VLA design space (foundational components, perception essentials, action modelling perspectives) performed inside a single fixed training and evaluation framework.

Load-bearing premise

That holding training protocols and test conditions constant is enough to reveal the true contribution of each design choice without leftover biases from implementation details or untested interactions.

What would settle it

Retraining VLANeXt from the released code on the same benchmarks but with a different random seed or data split and finding that it no longer outperforms the previous best models would challenge the claim that the twelve findings form a robust recipe.

Figures

Figures reproduced from arXiv: 2602.18532 by Bin Fan, Chen Change Loy, Jian-Jian Jiang, Kang Liao, Runze Yang, Wei-Shi Zheng, Xiao-Ming Wu, Yihang Luo, Zhonghua Wu.

Figure 1
Figure 1. Figure 1: Performance comparison on the LIBERO and LIBERO-plus benchmarks. We compare VLANeXt with repre￾sentative VLA baselines across model scales. Despite its smaller model size, VLANeXt achieves higher success rates than prior methods on both standard task performance (LIBERO) and robust￾ness/generalization (LIBERO-plus), demonstrating the effective￾ness of the design recipe distilled in this work. Vision–Langua… view at source ↗
Figure 2
Figure 2. Figure 2: Ablation trajectory across the VLA design space (spatial suite). We progressively evolve a baseline VLA through changes in foundational components, perception, and action mod￾eling. Results are reported on LIBERO initially, and on LIBERO￾plus once LIBERO performance saturates, providing a more sensi￾tive test of robustness and generalization. The trajectory culminates in the final VLANeXt model (2.5B) vs. … view at source ↗
Figure 4
Figure 4. Figure 4: Design choices for the VLM-Policy connection. Action Chunking. Our baseline predicts actions one step at a time. Here, we evaluate action chunking, which pre￾dicts multiple future actions jointly and is known to improve inference efficiency (Kim et al., 2025). Results show that longer chunk horizons consistently improve action gener￾ation performance ( [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Design choices for proprioception conditioning. (b) world modelling image future image generator meta query MLLM text, image actions noise flows of actions diffusion (a) normal flows of actions diffusion meta query MLLM text, image noise actions [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Augmenting action prediction with an auxiliary world modeling objective. action tokens text tokens image tokens z x y roll yaw hist. proprioceptions instruction multi-view frames action projector text embedding vision encoder meta query (soft) Put the black objects into the draw. MLLM pitch Frequency (Hz) z x y roll yaw future actions pitch diffusion noise flows of actions [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 7
Figure 7. Figure 7: VLANeXt architecture. Multi-view visual inputs, lan￾guage instructions, and proprioception are tokenized and processed by a multimodal LLM, with meta queries enabling soft interaction with the policy module. Action chunks are predicted using flow matching and further regularized by a frequency-domain objective. diction (Zhou et al., 2022; Yi et al., 2023; Yang et al., 2024; Wang et al., 2025a), we introduc… view at source ↗
Figure 8
Figure 8. Figure 8: Our single-arm and bimanual arm tasks for the real-world experiments. world settings. In addition, even without bimanual training, our method can adapt to bimanual robotics tasks with decent performance, demonstrating the cross-embodiment adapt￾ability of the method. Additional video demonstrations of our experimental results are provided in the supplementary materials. 5. Conclusion This work moves toward… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative experiments of our method in real-world tasks. (a) Spatial (pick up the black bowl between the plate and the ramekin and place it on the plate) (b) Object (pick up the tomato sauce and place it in the basket) (c) Goal (open the middle drawer of the cabinet) (d) Long (put both the alphabet soup and the tomato sauce in the basket) [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative experiments of our method in the four suites of the LIBERO benchmark. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative experiments of our method in the 7 types of perturbations in the same task in the LIBERO-plus benchmark. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
read the original abstract

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic exploration of the Vision-Language-Action (VLA) design space under a single unified training and evaluation protocol. Starting from an RT-2-like baseline, the authors ablate choices across three dimensions—foundational components, perception essentials, and action modelling—distilling 12 findings that are combined into the VLANeXt model. The resulting model is reported to outperform prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks while also showing strong real-world performance; the authors release a codebase to support reproducibility.

Significance. If the ablations successfully isolate the contribution of each design choice, the work could provide a useful empirical recipe for the VLA community and reduce fragmentation in the literature. The release of a unified codebase is a concrete positive for reproducibility and future extensions. The significance is tempered by the need to confirm that reported gains are not inflated by untested cross-dimensional interactions or residual implementation biases.

major comments (2)
  1. [§4.3] §4.3 (Ablation Studies across Dimensions): The 12 findings are derived from separate ablations along the three axes, yet no explicit cross-term experiments (e.g., perception change combined with action-head change) are reported. Without such tests, it remains possible that observed gains arise from synergistic interactions rather than additive, independent contributions, which directly affects the validity of distilling a separable 'recipe'.
  2. [Table 4] Table 4 (LIBERO and LIBERO-plus Results): The performance tables compare VLANeXt against baselines under the claimed unified protocol, but the manuscript does not provide sufficient detail on whether every baseline was re-implemented and trained from scratch with identical hyperparameters, data splits, and random seeds. This information is load-bearing for attributing gains specifically to the 12 findings rather than hidden implementation advantages.
minor comments (2)
  1. [§3.2] §3.2 (Perception Essentials): The notation for the visual encoder variants is introduced without an explicit summary table mapping each variant to its architectural differences; adding such a table would improve readability.
  2. [Figure 5] Figure 5 (Real-world Rollouts): The caption does not specify the number of trials or success criteria used for the reported success rates; this detail should be added for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our experimental design and commit to specific revisions that strengthen the claims regarding the validity of the distilled recipe and the fairness of baseline comparisons.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (Ablation Studies across Dimensions): The 12 findings are derived from separate ablations along the three axes, yet no explicit cross-term experiments (e.g., perception change combined with action-head change) are reported. Without such tests, it remains possible that observed gains arise from synergistic interactions rather than additive, independent contributions, which directly affects the validity of distilling a separable 'recipe'.

    Authors: We agree that the current ablations vary one dimension at a time while holding the others at baseline values, which isolates marginal contributions but does not explicitly test for cross-dimensional interactions. This sequential approach is standard for distilling independent design insights, yet we recognize that unexamined synergies could affect the separability claim. In the revised manuscript we will add a dedicated cross-term ablation subsection that combines representative changes from the perception and action-modeling axes (e.g., best perception module paired with best action head) and report the resulting performance deltas relative to the additive expectation. revision: yes

  2. Referee: [Table 4] Table 4 (LIBERO and LIBERO-plus Results): The performance tables compare VLANeXt against baselines under the claimed unified protocol, but the manuscript does not provide sufficient detail on whether every baseline was re-implemented and trained from scratch with identical hyperparameters, data splits, and random seeds. This information is load-bearing for attributing gains specifically to the 12 findings rather than hidden implementation advantages.

    Authors: All reported baselines were re-implemented from the original papers and trained from scratch inside the same unified training and evaluation harness, using identical optimizer settings, data splits, batch sizes, and random seeds. To make this explicit, we will expand the experimental-setup section and add an appendix table that lists, for each baseline, the source code reference, exact hyperparameter values used, and confirmation that training was performed under the VLANeXt protocol rather than relying on previously published numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical VLA ablation study

full rationale

The paper's chain begins with an RT-2-like baseline and proceeds via systematic empirical ablations across three design dimensions in a unified framework, distilling 12 findings that are then assembled into VLANeXt. Performance is measured on external benchmarks (LIBERO, LIBERO-plus) and real-world tests rather than being derived from fitted parameters or self-referential definitions. No equations, self-citations, or ansatzes are invoked in a load-bearing way that would make outputs equivalent to inputs by construction. The unified protocol and codebase release further support independent reproducibility outside the paper's own runs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical ablation study; it introduces no new mathematical axioms, free parameters fitted to target results, or invented physical entities. It relies on standard assumptions that benchmark tasks measure policy quality and that controlled variation isolates design effects.

axioms (1)
  • domain assumption Design choices along foundational components, perception essentials, and action modelling can be varied independently inside one shared training and evaluation protocol.
    Invoked when the paper states it systematically dissects the design space under a unified framework.

pith-pipeline@v0.9.0 · 5790 in / 1247 out tokens · 62111 ms · 2026-05-21T12:53:02.803599+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

    cs.LG 2026-05 conditional novelty 7.0

    Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

  2. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  3. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 2 Pith papers · 24 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y ., Chen, R., Chen, K., et al. Qwen3-vl techni- cal report.arXiv preprint arXiv:2511.21631, 2025a. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025b. Bai, Z., Gao, C., and Shou, M. Z. Evolve-vla: Test-time training from e...

  2. [2]

    3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks

    Bhat, V ., Lan, Y .-H., Krishnamurthy, P., Karri, R., and Khor- rami, F. 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks. arXiv preprint arXiv:2505.05800,

  3. [3]

    Motus: A Unified Latent Action World Model

    Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y ., Xiang, C., Rong, Y ., et al. Mo- tus: A unified latent action world model.arXiv preprint arXiv:2512.13030,

  4. [4]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Casta˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  6. [6]

    Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a

    Cen, J., Huang, S., Yuan, Y ., Li, K., Yuan, H., Yu, C., Jiang, Y ., Guo, J., Li, X., Luo, H., et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a. Cen, J., Yu, C., Yuan, H., Jiang, Y ., Huang, S., Guo, J., Li, X., Song, Y ., Luo, H., Wang, F., et al. Worldvla: To- wards autoregressive action world m...

  7. [7]

    Combatvla: An efficient vision-language-action model for combat tasks in 3d action role-playing games.arXiv preprint arXiv:2503.09527, 2025a

    Chen, P., Bu, P., Wang, Y ., Wang, X., Wang, Z., Guo, J., Zhao, Y ., Zhu, Q., Song, J., Yang, S., et al. Combatvla: An efficient vision-language-action model for combat tasks in 3d action role-playing games.arXiv preprint arXiv:2503.09527, 2025a. Chen, Z., Niu, R., Kong, H., Wang, Q., Xing, Q., and Fan, Z. Tgrpo: Fine-tuning vision-language-action model v...

  8. [8]

    Cui, Y ., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y ., Luo, Z., Wang, J., Wang, W., et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

  9. [9]

    Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

    Ding, P., Ma, J., Tong, X., Zou, B., Luo, X., Fan, Y ., Wang, T., Lu, H., Mo, P., Liu, J., et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795,

  10. [10]

    Srpo: Self-referential policy optimization for vision-language-action models.arXiv preprint arXiv:2511.15605, 2025

    Fei, S., Wang, S., Ji, L., Li, A., Zhang, S., Liu, L., Hou, J., Gong, J., Zhao, X., and Qiu, X. Srpo: Self-referential 9 VLANeXt: Recipes for Building Strong VLA Models policy optimization for vision-language-action models. arXiv preprint arXiv:2511.15605, 2025a. Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., e...

  11. [11]

    Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

    Goyal, A., Hadfield, H., Yang, X., Blukis, V ., and Ramos, F. Vla-0: Building state-of-the-art vlas with zero modifica- tion.arXiv preprint arXiv:2510.13054,

  12. [12]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  13. [13]

    Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search

    Guo, W., Lu, G., Deng, H., Wu, Z., Tang, Y ., and Wang, Z. Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search. arXiv preprint arXiv:2509.22643,

  14. [14]

    Training Large Language Models to Reason in a Continuous Latent Space

    Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

  15. [15]

    ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    Huang, C.-P., Wu, Y .-H., Chen, M.-H., Wang, Y .-C. F., and Yang, F.-E. Thinkact: Vision-language-action reason- ing via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025a. Huang, J., Wang, S., Lin, F., Hu, Y ., Wen, C., and Gao, Y . Tactile-vla: unlocking vision-language-action model’s physical knowledge for tactile generalization.a...

  16. [16]

    NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

    Huang, W., Wang, C., Li, Y ., Zhang, R., and Fei-Fei, L. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. InCoRL, 2025c. Hung, C.-Y ., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Majumder, N., Poria, S., et al. Nora: A small open- sourced generalist vision language action model for em- bodied tasks.arXiv pre...

  17. [17]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dha- balia, K., DiCarlo, J., et al. pi0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025a. Intelligence, P., Black, K., Brown, N., Darpinian, J., Dha- balia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al...

  18. [18]

    Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414,

    Kareer, S., Pertsch, K., Darpinian, J., Hoffman, J., Xu, D., Levine, S., Finn, C., and Nair, S. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414,

  19. [19]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model. InCoRL, 2024a. Kim, M. J., Finn, C., and Liang, P. Fine-tuning vision- language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,

  20. [20]

    Not only rewards but also constraints: Applications on legged robot locomotion.TRO, 2024b

    Kim, Y ., Oh, H., Lee, J., Choi, J., Ji, G., Jung, M., Youm, D., and Hwangbo, J. Not only rewards but also constraints: Applications on legged robot locomotion.TRO, 2024b. Kuang, F., You, J., Hu, Y ., Zhang, T., Wen, C., and Gao, Y . Adapt your body: Mitigating proprioception shifts in imitation learning.arXiv preprint arXiv:2506.23944,

  21. [21]

    MolmoAct: Action Reasoning Models that can Reason in Space

    10 VLANeXt: Recipes for Building Strong VLA Models Lee, J., Duan, J., Fang, H., Deng, Y ., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y . R., Lee, S., et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,

  22. [22]

    Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a

    Li, H., Yang, S., Chen, Y ., Tian, Y ., Yang, X., Chen, X., Wang, H., Wang, T., Zhao, F., Lin, D., et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a. Li, H., Zuo, Y ., Yu, J., Zhang, Y ., Yang, Z., Zhang, K., Zhu, X., Zhang, Y ., Chen, T., Cui, G., et al. Simplevla-rl...

  23. [23]

    Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975,

    Liang, H., Chen, X., Wang, B., Chen, M., Liu, Y ., Zhang, Y ., Chen, Z., Yang, T., Chen, Y ., Pang, J., et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975,

  24. [24]

    VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    Lu, G., Guo, W., Zhang, C., Zhou, Y ., Jiang, H., Gao, Z., Tang, Y ., and Wang, Z. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719,

  25. [25]

    F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

    Lv, Q., Kong, W., Li, H., Zeng, J., Qiu, Z., Qu, D., Song, H., Chen, Q., Deng, X., and Pang, J. F1: A vision-language- action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951,

  26. [26]

    A Survey on Vision-Language-Action Models for Embodied AI

    Ma, Y ., Song, Z., Zhuang, Y ., Hao, J., and King, I. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093,

  27. [27]

    Transfer between Modalities with MetaQueries

    Pan, X., Shukla, S. N., Singh, A., Zhao, Z., Mishra, S. K., Wang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F., Hou, J., and Xie, S. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256,

  28. [28]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

  29. [29]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Qu, D., Song, H., Chen, Q., Yao, Y ., Ye, X., Ding, Y ., Wang, Z., Gu, J., Zhao, B., Wang, D., et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830,

  30. [30]

    E., Otto, F., and Lioutikov, R

    Reuss, M., Zhou, H., R ¨uhle, M., Ya ˘gmurlu, ¨O. E., Otto, F., and Lioutikov, R. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996,

  31. [31]

    MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

    11 VLANeXt: Recipes for Building Strong VLA Models Shi, H., Xie, B., Liu, Y ., Sun, L., Liu, F., Wang, T., Zhou, E., Fan, H., Zhang, X., and Huang, G. Memo- ryvla: Perceptual-cognitive memory in vision-language- action models for robotic manipulation.arXiv preprint arXiv:2508.19236,

  32. [32]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al. Smolvla: A vision-language- action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

  33. [33]

    Gemini Robotics: Bringing AI into the Physical World

    Team, G. R., Abeyruwan, S., Ainslie, J., Alayrac, J.-B., Arenas, M. G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020,

  34. [34]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

  35. [35]

    End-to-end Listen, Look, Speak and Act

    Wang, S., Yu, W., Chen, X., Tian, X., Zhang, J., Lu, L., and Zhang, C. End-to-end listen, look, speak and act.arXiv preprint arXiv:2510.16756, 2025b. Wang, Y ., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509....

  36. [36]

    World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

    Xiao, J., Yang, Y ., Chang, X., Chen, R., Xiong, F., Xu, M., Zheng, W.-S., and Zhang, Q. World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948, 2025a. Xiao, L., Li, J., Gao, J., Ye, F., Jin, Y ., Qian, J., Zhang, J., Wu, Y ., and Yu, X. Ava-vla: Improving vision-language- action models with activ...

  37. [37]

    4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a

    Zhang, J., Chen, Y ., Xu, Y ., Huang, Z., Zhou, Y ., Yuan, Y .-J., Cai, X., Huang, G., Quan, X., Xu, H., et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a. 12 VLANeXt: Recipes for Building Strong VLA Models Zhang, J., Guo, Y ., Hu, Y ., Chen, X., Zhu, X., and Chen, J. Up-vl...

  38. [38]

    Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge

    Zhang, W., Liu, H., Qi, Z., Wang, Y ., Yu, X., Zhang, J., Dong, R., He, J., Lu, F., Wang, H., et al. Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge. InNeurIPS, 2025c. Zhang, Z., Zheng, K., Chen, Z., Jang, J., Li, Y ., Han, S., Wang, C., Ding, M., Fox, D., and Yao, H. Grape: Gener- alizing robot policy via preference ...

  39. [39]

    Flowvla: Visual chain of thought-based motion reason- ing for vision-language-action models.arXiv preprint arXiv:2508.18269,

    Zhong, Z., Yan, H., Li, J., Liu, X., Gong, X., Zhang, T., Song, W., Chen, J., Zheng, X., Wang, H., et al. Flowvla: Visual chain of thought-based motion reason- ing for vision-language-action models.arXiv preprint arXiv:2508.18269,

  40. [40]

    More Experimental Results A.1

    13 VLANeXt: Recipes for Building Strong VLA Models A. More Experimental Results A.1. Qualitative Experiments We present more demos of our model on the LIBERO and LIBERO-plus benchmarks, as well as in real-world settings (see Figures 10, 11, and 9). Additional video demonstrations of our experimental results are provided in the supplementary materials. (a)...

  41. [41]

    primordial soup

    to enhance action generation, and designing post-training optimization like planning or reinforcement learning to adapt to specific environment (Guo et al., 2025; Zhang et al., 2025d; Bai et al., 2025c; Tan et al., 2025; Li et al., 2025b; Fei et al., 2025a; Chen et al., 2025b; Huang et al., 2025a; Lu et al., 2025; Xiao et al., 2025a;b). Additionally, a su...