VLANeXt: Recipes for Building Strong VLA Models

Bin Fan; Chen Change Loy; Jian-Jian Jiang; Kang Liao; Runze Yang; Wei-Shi Zheng; Xiao-Ming Wu; Yihang Luo; Zhonghua Wu

arxiv: 2602.18532 · v2 · pith:HEPKW5KCnew · submitted 2026-02-20 · 💻 cs.CV · cs.AI· cs.RO

VLANeXt: Recipes for Building Strong VLA Models

Xiao-Ming Wu , Bin Fan , Kang Liao , Jian-Jian Jiang , Runze Yang , Yihang Luo , Zhonghua Wu , Wei-Shi Zheng

show 1 more author

Chen Change Loy

This is my paper

Pith reviewed 2026-05-21 12:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords Vision-Language-Action modelsVLA design spacerobot policy learningLIBERO benchmarkunified evaluation frameworkaction modellingperception essentials

0 comments

The pith

A controlled sweep of design choices in vision-language-action models yields a recipe that produces a stronger model called VLANeXt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to reduce confusion in the VLA field by holding training and evaluation conditions fixed while changing one design element at a time. It begins with a plain baseline similar to the earliest VLA models and then varies choices in three areas: which foundation model to start from, how perception is handled, and how actions are turned into outputs. Twelve practical findings emerge from this comparison and combine into VLANeXt. A sympathetic reader would care because a shared recipe could replace scattered trial-and-error with clearer steps toward reliable robot policies that work across many tasks.

Core claim

Under one shared training protocol and evaluation setup, systematic variation of foundational components, perception essentials, and action modelling perspectives produces twelve concrete findings. When these findings are combined, the resulting VLANeXt model surpasses prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and also succeeds in real-world robot experiments.

What carries the argument

The three-dimensional dissection of the VLA design space (foundational components, perception essentials, action modelling perspectives) performed inside a single fixed training and evaluation framework.

Load-bearing premise

That holding training protocols and test conditions constant is enough to reveal the true contribution of each design choice without leftover biases from implementation details or untested interactions.

What would settle it

Retraining VLANeXt from the released code on the same benchmarks but with a different random seed or data split and finding that it no longer outperforms the previous best models would challenge the claim that the twelve findings form a robust recipe.

Figures

Figures reproduced from arXiv: 2602.18532 by Bin Fan, Chen Change Loy, Jian-Jian Jiang, Kang Liao, Runze Yang, Wei-Shi Zheng, Xiao-Ming Wu, Yihang Luo, Zhonghua Wu.

**Figure 1.** Figure 1: Performance comparison on the LIBERO and LIBERO-plus benchmarks. We compare VLANeXt with representative VLA baselines across model scales. Despite its smaller model size, VLANeXt achieves higher success rates than prior methods on both standard task performance (LIBERO) and robustness/generalization (LIBERO-plus), demonstrating the effectiveness of the design recipe distilled in this work. Vision–Langua… view at source ↗

**Figure 2.** Figure 2: Ablation trajectory across the VLA design space (spatial suite). We progressively evolve a baseline VLA through changes in foundational components, perception, and action modeling. Results are reported on LIBERO initially, and on LIBEROplus once LIBERO performance saturates, providing a more sensitive test of robustness and generalization. The trajectory culminates in the final VLANeXt model (2.5B) vs. … view at source ↗

**Figure 4.** Figure 4: Design choices for the VLM-Policy connection. Action Chunking. Our baseline predicts actions one step at a time. Here, we evaluate action chunking, which predicts multiple future actions jointly and is known to improve inference efficiency (Kim et al., 2025). Results show that longer chunk horizons consistently improve action generation performance ( [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Design choices for proprioception conditioning. (b) world modelling image future image generator meta query MLLM text, image actions noise flows of actions diffusion (a) normal flows of actions diffusion meta query MLLM text, image noise actions [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Augmenting action prediction with an auxiliary world modeling objective. action tokens text tokens image tokens z x y roll yaw hist. proprioceptions instruction multi-view frames action projector text embedding vision encoder meta query (soft) Put the black objects into the draw. MLLM pitch Frequency (Hz) z x y roll yaw future actions pitch diffusion noise flows of actions [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 7.** Figure 7: VLANeXt architecture. Multi-view visual inputs, language instructions, and proprioception are tokenized and processed by a multimodal LLM, with meta queries enabling soft interaction with the policy module. Action chunks are predicted using flow matching and further regularized by a frequency-domain objective. diction (Zhou et al., 2022; Yi et al., 2023; Yang et al., 2024; Wang et al., 2025a), we introduc… view at source ↗

**Figure 8.** Figure 8: Our single-arm and bimanual arm tasks for the real-world experiments. world settings. In addition, even without bimanual training, our method can adapt to bimanual robotics tasks with decent performance, demonstrating the cross-embodiment adaptability of the method. Additional video demonstrations of our experimental results are provided in the supplementary materials. 5. Conclusion This work moves toward… view at source ↗

**Figure 9.** Figure 9: Qualitative experiments of our method in real-world tasks. (a) Spatial (pick up the black bowl between the plate and the ramekin and place it on the plate) (b) Object (pick up the tomato sauce and place it in the basket) (c) Goal (open the middle drawer of the cabinet) (d) Long (put both the alphabet soup and the tomato sauce in the basket) [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative experiments of our method in the four suites of the LIBERO benchmark. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative experiments of our method in the 7 types of perturbations in the same task in the LIBERO-plus benchmark. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper delivers twelve concrete findings from controlled ablations in a single VLA setup, yielding VLANeXt with reported gains on LIBERO, but the isolation of those findings rests on an assumption of separable design dimensions that may not fully hold.

read the letter

The main point is that the authors take the messy VLA literature, fix the training and eval protocol, and run a structured sweep across foundational components, perception essentials, and action modeling. From that they extract twelve findings and assemble VLANeXt, which they say beats prior work on LIBERO and LIBERO-plus while also showing some real-robot results. They release the code, which is the most immediately useful part for anyone who wants to try the recipes themselves.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic exploration of the Vision-Language-Action (VLA) design space under a single unified training and evaluation protocol. Starting from an RT-2-like baseline, the authors ablate choices across three dimensions—foundational components, perception essentials, and action modelling—distilling 12 findings that are combined into the VLANeXt model. The resulting model is reported to outperform prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks while also showing strong real-world performance; the authors release a codebase to support reproducibility.

Significance. If the ablations successfully isolate the contribution of each design choice, the work could provide a useful empirical recipe for the VLA community and reduce fragmentation in the literature. The release of a unified codebase is a concrete positive for reproducibility and future extensions. The significance is tempered by the need to confirm that reported gains are not inflated by untested cross-dimensional interactions or residual implementation biases.

major comments (2)

[§4.3] §4.3 (Ablation Studies across Dimensions): The 12 findings are derived from separate ablations along the three axes, yet no explicit cross-term experiments (e.g., perception change combined with action-head change) are reported. Without such tests, it remains possible that observed gains arise from synergistic interactions rather than additive, independent contributions, which directly affects the validity of distilling a separable 'recipe'.
[Table 4] Table 4 (LIBERO and LIBERO-plus Results): The performance tables compare VLANeXt against baselines under the claimed unified protocol, but the manuscript does not provide sufficient detail on whether every baseline was re-implemented and trained from scratch with identical hyperparameters, data splits, and random seeds. This information is load-bearing for attributing gains specifically to the 12 findings rather than hidden implementation advantages.

minor comments (2)

[§3.2] §3.2 (Perception Essentials): The notation for the visual encoder variants is introduced without an explicit summary table mapping each variant to its architectural differences; adding such a table would improve readability.
[Figure 5] Figure 5 (Real-world Rollouts): The caption does not specify the number of trials or success criteria used for the reported success rates; this detail should be added for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our experimental design and commit to specific revisions that strengthen the claims regarding the validity of the distilled recipe and the fairness of baseline comparisons.

read point-by-point responses

Referee: [§4.3] §4.3 (Ablation Studies across Dimensions): The 12 findings are derived from separate ablations along the three axes, yet no explicit cross-term experiments (e.g., perception change combined with action-head change) are reported. Without such tests, it remains possible that observed gains arise from synergistic interactions rather than additive, independent contributions, which directly affects the validity of distilling a separable 'recipe'.

Authors: We agree that the current ablations vary one dimension at a time while holding the others at baseline values, which isolates marginal contributions but does not explicitly test for cross-dimensional interactions. This sequential approach is standard for distilling independent design insights, yet we recognize that unexamined synergies could affect the separability claim. In the revised manuscript we will add a dedicated cross-term ablation subsection that combines representative changes from the perception and action-modeling axes (e.g., best perception module paired with best action head) and report the resulting performance deltas relative to the additive expectation. revision: yes
Referee: [Table 4] Table 4 (LIBERO and LIBERO-plus Results): The performance tables compare VLANeXt against baselines under the claimed unified protocol, but the manuscript does not provide sufficient detail on whether every baseline was re-implemented and trained from scratch with identical hyperparameters, data splits, and random seeds. This information is load-bearing for attributing gains specifically to the 12 findings rather than hidden implementation advantages.

Authors: All reported baselines were re-implemented from the original papers and trained from scratch inside the same unified training and evaluation harness, using identical optimizer settings, data splits, batch sizes, and random seeds. To make this explicit, we will expand the experimental-setup section and add an appendix table that lists, for each baseline, the source code reference, exact hyperparameter values used, and confirmation that training was performed under the VLANeXt protocol rather than relying on previously published numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical VLA ablation study

full rationale

The paper's chain begins with an RT-2-like baseline and proceeds via systematic empirical ablations across three design dimensions in a unified framework, distilling 12 findings that are then assembled into VLANeXt. Performance is measured on external benchmarks (LIBERO, LIBERO-plus) and real-world tests rather than being derived from fitted parameters or self-referential definitions. No equations, self-citations, or ansatzes are invoked in a load-bearing way that would make outputs equivalent to inputs by construction. The unified protocol and codebase release further support independent reproducibility outside the paper's own runs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical ablation study; it introduces no new mathematical axioms, free parameters fitted to target results, or invented physical entities. It relies on standard assumptions that benchmark tasks measure policy quality and that controlled variation isolates design effects.

axioms (1)

domain assumption Design choices along foundational components, perception essentials, and action modelling can be varied independently inside one shared training and evaluation protocol.
Invoked when the paper states it systematically dissects the design space under a unified framework.

pith-pipeline@v0.9.0 · 5790 in / 1247 out tokens · 62111 ms · 2026-05-21T12:53:02.803599+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
cs.LG 2026-05 conditional novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 2 Pith papers · 24 internal anchors

[1]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., et al. Qwen3-vl techni- cal report.arXiv preprint arXiv:2511.21631, 2025a. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025b. Bai, Z., Gao, C., and Shou, M. Z. Evolve-vla: Test-time training from e...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks

Bhat, V ., Lan, Y .-H., Krishnamurthy, P., Karri, R., and Khor- rami, F. 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks. arXiv preprint arXiv:2505.05800,

work page arXiv
[3]

Motus: A Unified Latent Action World Model

Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y ., Xiang, C., Rong, Y ., et al. Mo- tus: A unified latent action world model.arXiv preprint arXiv:2512.13030,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Casta˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a

Cen, J., Huang, S., Yuan, Y ., Li, K., Yuan, H., Yu, C., Jiang, Y ., Guo, J., Li, X., Luo, H., et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a. Cen, J., Yu, C., Yuan, H., Jiang, Y ., Huang, S., Guo, J., Li, X., Song, Y ., Luo, H., Wang, F., et al. Worldvla: To- wards autoregressive action world m...

work page arXiv
[7]

Combatvla: An efficient vision-language-action model for combat tasks in 3d action role-playing games.arXiv preprint arXiv:2503.09527, 2025a

Chen, P., Bu, P., Wang, Y ., Wang, X., Wang, Z., Guo, J., Zhao, Y ., Zhu, Q., Song, J., Yang, S., et al. Combatvla: An efficient vision-language-action model for combat tasks in 3d action role-playing games.arXiv preprint arXiv:2503.09527, 2025a. Chen, Z., Niu, R., Kong, H., Wang, Q., Xing, Q., and Fan, Z. Tgrpo: Fine-tuning vision-language-action model v...

work page arXiv
[8]

Cui, Y ., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y ., Luo, Z., Wang, J., Wang, W., et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

Ding, P., Ma, J., Tong, X., Zou, B., Luo, X., Fan, Y ., Wang, T., Lu, H., Mo, P., Liu, J., et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795,

work page arXiv
[10]

Srpo: Self-referential policy optimization for vision-language-action models.arXiv preprint arXiv:2511.15605, 2025

Fei, S., Wang, S., Ji, L., Li, A., Zhang, S., Liu, L., Hou, J., Gong, J., Zhao, X., and Qiu, X. Srpo: Self-referential 9 VLANeXt: Recipes for Building Strong VLA Models policy optimization for vision-language-action models. arXiv preprint arXiv:2511.15605, 2025a. Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., e...

work page arXiv
[11]

Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

Goyal, A., Hadfield, H., Yang, X., Blukis, V ., and Ramos, F. Vla-0: Building state-of-the-art vlas with zero modifica- tion.arXiv preprint arXiv:2510.13054,

work page arXiv
[12]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search

Guo, W., Lu, G., Deng, H., Wu, Z., Tang, Y ., and Wang, Z. Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search. arXiv preprint arXiv:2509.22643,

work page arXiv
[14]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Huang, C.-P., Wu, Y .-H., Chen, M.-H., Wang, Y .-C. F., and Yang, F.-E. Thinkact: Vision-language-action reason- ing via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025a. Huang, J., Wang, S., Lin, F., Hu, Y ., Wen, C., and Gao, Y . Tactile-vla: unlocking vision-language-action model’s physical knowledge for tactile generalization.a...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Huang, W., Wang, C., Li, Y ., Zhang, R., and Fei-Fei, L. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. InCoRL, 2025c. Hung, C.-Y ., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Majumder, N., Poria, S., et al. Nora: A small open- sourced generalist vision language action model for em- bodied tasks.arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dha- balia, K., DiCarlo, J., et al. pi0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025a. Intelligence, P., Black, K., Brown, N., Darpinian, J., Dha- balia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414,

Kareer, S., Pertsch, K., Darpinian, J., Hoffman, J., Xu, D., Levine, S., Finn, C., and Nair, S. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414,

work page arXiv
[19]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model. InCoRL, 2024a. Kim, M. J., Finn, C., and Liang, P. Fine-tuning vision- language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Not only rewards but also constraints: Applications on legged robot locomotion.TRO, 2024b

Kim, Y ., Oh, H., Lee, J., Choi, J., Ji, G., Jung, M., Youm, D., and Hwangbo, J. Not only rewards but also constraints: Applications on legged robot locomotion.TRO, 2024b. Kuang, F., You, J., Hu, Y ., Zhang, T., Wen, C., and Gao, Y . Adapt your body: Mitigating proprioception shifts in imitation learning.arXiv preprint arXiv:2506.23944,

work page arXiv
[21]

MolmoAct: Action Reasoning Models that can Reason in Space

10 VLANeXt: Recipes for Building Strong VLA Models Lee, J., Duan, J., Fang, H., Deng, Y ., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y . R., Lee, S., et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a

Li, H., Yang, S., Chen, Y ., Tian, Y ., Yang, X., Chen, X., Wang, H., Wang, T., Zhao, F., Lin, D., et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a. Li, H., Zuo, Y ., Yu, J., Zhang, Y ., Yang, Z., Zhang, K., Zhu, X., Zhang, Y ., Chen, T., Cui, G., et al. Simplevla-rl...

work page arXiv
[23]

Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975,

Liang, H., Chen, X., Wang, B., Chen, M., Liu, Y ., Zhang, Y ., Chen, Z., Yang, T., Chen, Y ., Pang, J., et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975,

work page arXiv
[24]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Lu, G., Guo, W., Zhang, C., Zhou, Y ., Jiang, H., Gao, Z., Tang, Y ., and Wang, Z. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Lv, Q., Kong, W., Li, H., Zeng, J., Qiu, Z., Qu, D., Song, H., Chen, Q., Deng, X., and Pang, J. F1: A vision-language- action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

A Survey on Vision-Language-Action Models for Embodied AI

Ma, Y ., Song, Z., Zhuang, Y ., Hao, J., and King, I. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Transfer between Modalities with MetaQueries

Pan, X., Shukla, S. N., Singh, A., Zhao, Z., Mishra, S. K., Wang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F., Hou, J., and Xie, S. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Qu, D., Song, H., Chen, Q., Yao, Y ., Ye, X., Ding, Y ., Wang, Z., Gu, J., Zhao, B., Wang, D., et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

E., Otto, F., and Lioutikov, R

Reuss, M., Zhou, H., R ¨uhle, M., Ya ˘gmurlu, ¨O. E., Otto, F., and Lioutikov, R. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996,

work page arXiv
[31]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

11 VLANeXt: Recipes for Building Strong VLA Models Shi, H., Xie, B., Liu, Y ., Sun, L., Liu, F., Wang, T., Zhou, E., Fan, H., Zhang, X., and Huang, G. Memo- ryvla: Perceptual-cognitive memory in vision-language- action models for robotic manipulation.arXiv preprint arXiv:2508.19236,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al. Smolvla: A vision-language- action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Gemini Robotics: Bringing AI into the Physical World

Team, G. R., Abeyruwan, S., Ainslie, J., Alayrac, J.-B., Arenas, M. G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

End-to-end Listen, Look, Speak and Act

Wang, S., Yu, W., Chen, X., Tian, X., Zhang, J., Lu, L., and Zhang, C. End-to-end listen, look, speak and act.arXiv preprint arXiv:2510.16756, 2025b. Wang, Y ., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509....

work page internal anchor Pith review Pith/arXiv arXiv
[36]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Xiao, J., Yang, Y ., Chang, X., Chen, R., Xiong, F., Xu, M., Zheng, W.-S., and Zhang, Q. World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948, 2025a. Xiao, L., Li, J., Gao, J., Ye, F., Jin, Y ., Qian, J., Zhang, J., Wu, Y ., and Yu, X. Ava-vla: Improving vision-language- action models with activ...

work page internal anchor Pith review Pith/arXiv arXiv
[37]

4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a

Zhang, J., Chen, Y ., Xu, Y ., Huang, Z., Zhou, Y ., Yuan, Y .-J., Cai, X., Huang, G., Quan, X., Xu, H., et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a. 12 VLANeXt: Recipes for Building Strong VLA Models Zhang, J., Guo, Y ., Hu, Y ., Chen, X., Zhu, X., and Chen, J. Up-vl...

work page arXiv
[38]

Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge

Zhang, W., Liu, H., Qi, Z., Wang, Y ., Yu, X., Zhang, J., Dong, R., He, J., Lu, F., Wang, H., et al. Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge. InNeurIPS, 2025c. Zhang, Z., Zheng, K., Chen, Z., Jang, J., Li, Y ., Han, S., Wang, C., Ding, M., Fox, D., and Yao, H. Grape: Gener- alizing robot policy via preference ...

work page arXiv
[39]

Flowvla: Visual chain of thought-based motion reason- ing for vision-language-action models.arXiv preprint arXiv:2508.18269,

Zhong, Z., Yan, H., Li, J., Liu, X., Gong, X., Zhang, T., Song, W., Chen, J., Zheng, X., Wang, H., et al. Flowvla: Visual chain of thought-based motion reason- ing for vision-language-action models.arXiv preprint arXiv:2508.18269,

work page arXiv
[40]

More Experimental Results A.1

13 VLANeXt: Recipes for Building Strong VLA Models A. More Experimental Results A.1. Qualitative Experiments We present more demos of our model on the LIBERO and LIBERO-plus benchmarks, as well as in real-world settings (see Figures 10, 11, and 9). Additional video demonstrations of our experimental results are provided in the supplementary materials. (a)...

work page 2020
[41]

primordial soup

to enhance action generation, and designing post-training optimization like planning or reinforcement learning to adapt to specific environment (Guo et al., 2025; Zhang et al., 2025d; Bai et al., 2025c; Tan et al., 2025; Li et al., 2025b; Fei et al., 2025a; Chen et al., 2025b; Huang et al., 2025a; Lu et al., 2025; Xiao et al., 2025a;b). Additionally, a su...

work page 2025

[1] [1]

Qwen3-VL Technical Report

Bai, S., Cai, Y ., Chen, R., Chen, K., et al. Qwen3-vl techni- cal report.arXiv preprint arXiv:2511.21631, 2025a. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025b. Bai, Z., Gao, C., and Shou, M. Z. Evolve-vla: Test-time training from e...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks

Bhat, V ., Lan, Y .-H., Krishnamurthy, P., Karri, R., and Khor- rami, F. 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks. arXiv preprint arXiv:2505.05800,

work page arXiv

[3] [3]

Motus: A Unified Latent Action World Model

Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y ., Xiang, C., Rong, Y ., et al. Mo- tus: A unified latent action world model.arXiv preprint arXiv:2512.13030,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Casta˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a

Cen, J., Huang, S., Yuan, Y ., Li, K., Yuan, H., Yu, C., Jiang, Y ., Guo, J., Li, X., Luo, H., et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a. Cen, J., Yu, C., Yuan, H., Jiang, Y ., Huang, S., Guo, J., Li, X., Song, Y ., Luo, H., Wang, F., et al. Worldvla: To- wards autoregressive action world m...

work page arXiv

[7] [7]

Combatvla: An efficient vision-language-action model for combat tasks in 3d action role-playing games.arXiv preprint arXiv:2503.09527, 2025a

Chen, P., Bu, P., Wang, Y ., Wang, X., Wang, Z., Guo, J., Zhao, Y ., Zhu, Q., Song, J., Yang, S., et al. Combatvla: An efficient vision-language-action model for combat tasks in 3d action role-playing games.arXiv preprint arXiv:2503.09527, 2025a. Chen, Z., Niu, R., Kong, H., Wang, Q., Xing, Q., and Fan, Z. Tgrpo: Fine-tuning vision-language-action model v...

work page arXiv

[8] [8]

Cui, Y ., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y ., Luo, Z., Wang, J., Wang, W., et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

Ding, P., Ma, J., Tong, X., Zou, B., Luo, X., Fan, Y ., Wang, T., Lu, H., Mo, P., Liu, J., et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795,

work page arXiv

[10] [10]

Srpo: Self-referential policy optimization for vision-language-action models.arXiv preprint arXiv:2511.15605, 2025

Fei, S., Wang, S., Ji, L., Li, A., Zhang, S., Liu, L., Hou, J., Gong, J., Zhao, X., and Qiu, X. Srpo: Self-referential 9 VLANeXt: Recipes for Building Strong VLA Models policy optimization for vision-language-action models. arXiv preprint arXiv:2511.15605, 2025a. Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., e...

work page arXiv

[11] [11]

Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

Goyal, A., Hadfield, H., Yang, X., Blukis, V ., and Ramos, F. Vla-0: Building state-of-the-art vlas with zero modifica- tion.arXiv preprint arXiv:2510.13054,

work page arXiv

[12] [12]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search

Guo, W., Lu, G., Deng, H., Wu, Z., Tang, Y ., and Wang, Z. Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search. arXiv preprint arXiv:2509.22643,

work page arXiv

[14] [14]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Huang, C.-P., Wu, Y .-H., Chen, M.-H., Wang, Y .-C. F., and Yang, F.-E. Thinkact: Vision-language-action reason- ing via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025a. Huang, J., Wang, S., Lin, F., Hu, Y ., Wen, C., and Gao, Y . Tactile-vla: unlocking vision-language-action model’s physical knowledge for tactile generalization.a...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Huang, W., Wang, C., Li, Y ., Zhang, R., and Fei-Fei, L. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. InCoRL, 2025c. Hung, C.-Y ., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Majumder, N., Poria, S., et al. Nora: A small open- sourced generalist vision language action model for em- bodied tasks.arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dha- balia, K., DiCarlo, J., et al. pi0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025a. Intelligence, P., Black, K., Brown, N., Darpinian, J., Dha- balia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414,

Kareer, S., Pertsch, K., Darpinian, J., Hoffman, J., Xu, D., Levine, S., Finn, C., and Nair, S. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414,

work page arXiv

[19] [19]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model. InCoRL, 2024a. Kim, M. J., Finn, C., and Liang, P. Fine-tuning vision- language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Not only rewards but also constraints: Applications on legged robot locomotion.TRO, 2024b

Kim, Y ., Oh, H., Lee, J., Choi, J., Ji, G., Jung, M., Youm, D., and Hwangbo, J. Not only rewards but also constraints: Applications on legged robot locomotion.TRO, 2024b. Kuang, F., You, J., Hu, Y ., Zhang, T., Wen, C., and Gao, Y . Adapt your body: Mitigating proprioception shifts in imitation learning.arXiv preprint arXiv:2506.23944,

work page arXiv

[21] [21]

MolmoAct: Action Reasoning Models that can Reason in Space

10 VLANeXt: Recipes for Building Strong VLA Models Lee, J., Duan, J., Fang, H., Deng, Y ., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y . R., Lee, S., et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a

Li, H., Yang, S., Chen, Y ., Tian, Y ., Yang, X., Chen, X., Wang, H., Wang, T., Zhao, F., Lin, D., et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a. Li, H., Zuo, Y ., Yu, J., Zhang, Y ., Yang, Z., Zhang, K., Zhu, X., Zhang, Y ., Chen, T., Cui, G., et al. Simplevla-rl...

work page arXiv

[23] [23]

Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975,

Liang, H., Chen, X., Wang, B., Chen, M., Liu, Y ., Zhang, Y ., Chen, Z., Yang, T., Chen, Y ., Pang, J., et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975,

work page arXiv

[24] [24]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Lu, G., Guo, W., Zhang, C., Zhou, Y ., Jiang, H., Gao, Z., Tang, Y ., and Wang, Z. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Lv, Q., Kong, W., Li, H., Zeng, J., Qiu, Z., Qu, D., Song, H., Chen, Q., Deng, X., and Pang, J. F1: A vision-language- action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

A Survey on Vision-Language-Action Models for Embodied AI

Ma, Y ., Song, Z., Zhuang, Y ., Hao, J., and King, I. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Transfer between Modalities with MetaQueries

Pan, X., Shukla, S. N., Singh, A., Zhao, Z., Mishra, S. K., Wang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F., Hou, J., and Xie, S. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Qu, D., Song, H., Chen, Q., Yao, Y ., Ye, X., Ding, Y ., Wang, Z., Gu, J., Zhao, B., Wang, D., et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

E., Otto, F., and Lioutikov, R

Reuss, M., Zhou, H., R ¨uhle, M., Ya ˘gmurlu, ¨O. E., Otto, F., and Lioutikov, R. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996,

work page arXiv

[31] [31]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

11 VLANeXt: Recipes for Building Strong VLA Models Shi, H., Xie, B., Liu, Y ., Sun, L., Liu, F., Wang, T., Zhou, E., Fan, H., Zhang, X., and Huang, G. Memo- ryvla: Perceptual-cognitive memory in vision-language- action models for robotic manipulation.arXiv preprint arXiv:2508.19236,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al. Smolvla: A vision-language- action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Gemini Robotics: Bringing AI into the Physical World

Team, G. R., Abeyruwan, S., Ainslie, J., Alayrac, J.-B., Arenas, M. G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

End-to-end Listen, Look, Speak and Act

Wang, S., Yu, W., Chen, X., Tian, X., Zhang, J., Lu, L., and Zhang, C. End-to-end listen, look, speak and act.arXiv preprint arXiv:2510.16756, 2025b. Wang, Y ., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509....

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Xiao, J., Yang, Y ., Chang, X., Chen, R., Xiong, F., Xu, M., Zheng, W.-S., and Zhang, Q. World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948, 2025a. Xiao, L., Li, J., Gao, J., Ye, F., Jin, Y ., Qian, J., Zhang, J., Wu, Y ., and Yu, X. Ava-vla: Improving vision-language- action models with activ...

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a

Zhang, J., Chen, Y ., Xu, Y ., Huang, Z., Zhou, Y ., Yuan, Y .-J., Cai, X., Huang, G., Quan, X., Xu, H., et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a. 12 VLANeXt: Recipes for Building Strong VLA Models Zhang, J., Guo, Y ., Hu, Y ., Chen, X., Zhu, X., and Chen, J. Up-vl...

work page arXiv

[38] [38]

Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge

Zhang, W., Liu, H., Qi, Z., Wang, Y ., Yu, X., Zhang, J., Dong, R., He, J., Lu, F., Wang, H., et al. Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge. InNeurIPS, 2025c. Zhang, Z., Zheng, K., Chen, Z., Jang, J., Li, Y ., Han, S., Wang, C., Ding, M., Fox, D., and Yao, H. Grape: Gener- alizing robot policy via preference ...

work page arXiv

[39] [39]

Flowvla: Visual chain of thought-based motion reason- ing for vision-language-action models.arXiv preprint arXiv:2508.18269,

Zhong, Z., Yan, H., Li, J., Liu, X., Gong, X., Zhang, T., Song, W., Chen, J., Zheng, X., Wang, H., et al. Flowvla: Visual chain of thought-based motion reason- ing for vision-language-action models.arXiv preprint arXiv:2508.18269,

work page arXiv

[40] [40]

More Experimental Results A.1

13 VLANeXt: Recipes for Building Strong VLA Models A. More Experimental Results A.1. Qualitative Experiments We present more demos of our model on the LIBERO and LIBERO-plus benchmarks, as well as in real-world settings (see Figures 10, 11, and 9). Additional video demonstrations of our experimental results are provided in the supplementary materials. (a)...

work page 2020

[41] [41]

primordial soup

to enhance action generation, and designing post-training optimization like planning or reinforcement learning to adapt to specific environment (Guo et al., 2025; Zhang et al., 2025d; Bai et al., 2025c; Tan et al., 2025; Li et al., 2025b; Fei et al., 2025a; Chen et al., 2025b; Huang et al., 2025a; Lu et al., 2025; Xiao et al., 2025a;b). Additionally, a su...

work page 2025