VLANeXt: Recipes for Building Strong VLA Models
Pith reviewed 2026-05-21 12:53 UTC · model grok-4.3
The pith
A controlled sweep of design choices in vision-language-action models yields a recipe that produces a stronger model called VLANeXt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under one shared training protocol and evaluation setup, systematic variation of foundational components, perception essentials, and action modelling perspectives produces twelve concrete findings. When these findings are combined, the resulting VLANeXt model surpasses prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and also succeeds in real-world robot experiments.
What carries the argument
The three-dimensional dissection of the VLA design space (foundational components, perception essentials, action modelling perspectives) performed inside a single fixed training and evaluation framework.
Load-bearing premise
That holding training protocols and test conditions constant is enough to reveal the true contribution of each design choice without leftover biases from implementation details or untested interactions.
What would settle it
Retraining VLANeXt from the released code on the same benchmarks but with a different random seed or data split and finding that it no longer outperforms the previous best models would challenge the claim that the twelve findings form a robust recipe.
Figures
read the original abstract
Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a systematic exploration of the Vision-Language-Action (VLA) design space under a single unified training and evaluation protocol. Starting from an RT-2-like baseline, the authors ablate choices across three dimensions—foundational components, perception essentials, and action modelling—distilling 12 findings that are combined into the VLANeXt model. The resulting model is reported to outperform prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks while also showing strong real-world performance; the authors release a codebase to support reproducibility.
Significance. If the ablations successfully isolate the contribution of each design choice, the work could provide a useful empirical recipe for the VLA community and reduce fragmentation in the literature. The release of a unified codebase is a concrete positive for reproducibility and future extensions. The significance is tempered by the need to confirm that reported gains are not inflated by untested cross-dimensional interactions or residual implementation biases.
major comments (2)
- [§4.3] §4.3 (Ablation Studies across Dimensions): The 12 findings are derived from separate ablations along the three axes, yet no explicit cross-term experiments (e.g., perception change combined with action-head change) are reported. Without such tests, it remains possible that observed gains arise from synergistic interactions rather than additive, independent contributions, which directly affects the validity of distilling a separable 'recipe'.
- [Table 4] Table 4 (LIBERO and LIBERO-plus Results): The performance tables compare VLANeXt against baselines under the claimed unified protocol, but the manuscript does not provide sufficient detail on whether every baseline was re-implemented and trained from scratch with identical hyperparameters, data splits, and random seeds. This information is load-bearing for attributing gains specifically to the 12 findings rather than hidden implementation advantages.
minor comments (2)
- [§3.2] §3.2 (Perception Essentials): The notation for the visual encoder variants is introduced without an explicit summary table mapping each variant to its architectural differences; adding such a table would improve readability.
- [Figure 5] Figure 5 (Real-world Rollouts): The caption does not specify the number of trials or success criteria used for the reported success rates; this detail should be added for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our experimental design and commit to specific revisions that strengthen the claims regarding the validity of the distilled recipe and the fairness of baseline comparisons.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Ablation Studies across Dimensions): The 12 findings are derived from separate ablations along the three axes, yet no explicit cross-term experiments (e.g., perception change combined with action-head change) are reported. Without such tests, it remains possible that observed gains arise from synergistic interactions rather than additive, independent contributions, which directly affects the validity of distilling a separable 'recipe'.
Authors: We agree that the current ablations vary one dimension at a time while holding the others at baseline values, which isolates marginal contributions but does not explicitly test for cross-dimensional interactions. This sequential approach is standard for distilling independent design insights, yet we recognize that unexamined synergies could affect the separability claim. In the revised manuscript we will add a dedicated cross-term ablation subsection that combines representative changes from the perception and action-modeling axes (e.g., best perception module paired with best action head) and report the resulting performance deltas relative to the additive expectation. revision: yes
-
Referee: [Table 4] Table 4 (LIBERO and LIBERO-plus Results): The performance tables compare VLANeXt against baselines under the claimed unified protocol, but the manuscript does not provide sufficient detail on whether every baseline was re-implemented and trained from scratch with identical hyperparameters, data splits, and random seeds. This information is load-bearing for attributing gains specifically to the 12 findings rather than hidden implementation advantages.
Authors: All reported baselines were re-implemented from the original papers and trained from scratch inside the same unified training and evaluation harness, using identical optimizer settings, data splits, batch sizes, and random seeds. To make this explicit, we will expand the experimental-setup section and add an appendix table that lists, for each baseline, the source code reference, exact hyperparameter values used, and confirmation that training was performed under the VLANeXt protocol rather than relying on previously published numbers. revision: yes
Circularity Check
No significant circularity in empirical VLA ablation study
full rationale
The paper's chain begins with an RT-2-like baseline and proceeds via systematic empirical ablations across three design dimensions in a unified framework, distilling 12 findings that are then assembled into VLANeXt. Performance is measured on external benchmarks (LIBERO, LIBERO-plus) and real-world tests rather than being derived from fitted parameters or self-referential definitions. No equations, self-citations, or ansatzes are invoked in a load-bearing way that would make outputs equivalent to inputs by construction. The unified protocol and codebase release further support independent reproducibility outside the paper's own runs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Design choices along foundational components, perception essentials, and action modelling can be varied independently inside one shared training and evaluation protocol.
Forward citations
Cited by 3 Pith papers
-
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
Reference graph
Works this paper leans on
-
[1]
Bai, S., Cai, Y ., Chen, R., Chen, K., et al. Qwen3-vl techni- cal report.arXiv preprint arXiv:2511.21631, 2025a. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025b. Bai, Z., Gao, C., and Shou, M. Z. Evolve-vla: Test-time training from e...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Bhat, V ., Lan, Y .-H., Krishnamurthy, P., Karri, R., and Khor- rami, F. 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks. arXiv preprint arXiv:2505.05800,
-
[3]
Motus: A Unified Latent Action World Model
Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y ., Xiang, C., Rong, Y ., et al. Mo- tus: A unified latent action world model.arXiv preprint arXiv:2512.13030,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Bjorck, J., Casta˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a
Cen, J., Huang, S., Yuan, Y ., Li, K., Yuan, H., Yu, C., Jiang, Y ., Guo, J., Li, X., Luo, H., et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a. Cen, J., Yu, C., Yuan, H., Jiang, Y ., Huang, S., Guo, J., Li, X., Song, Y ., Luo, H., Wang, F., et al. Worldvla: To- wards autoregressive action world m...
-
[7]
Chen, P., Bu, P., Wang, Y ., Wang, X., Wang, Z., Guo, J., Zhao, Y ., Zhu, Q., Song, J., Yang, S., et al. Combatvla: An efficient vision-language-action model for combat tasks in 3d action role-playing games.arXiv preprint arXiv:2503.09527, 2025a. Chen, Z., Niu, R., Kong, H., Wang, Q., Xing, Q., and Fan, Z. Tgrpo: Fine-tuning vision-language-action model v...
-
[8]
Cui, Y ., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y ., Luo, Z., Wang, J., Wang, W., et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Ding, P., Ma, J., Tong, X., Zou, B., Luo, X., Fan, Y ., Wang, T., Lu, H., Mo, P., Liu, J., et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795,
-
[10]
Fei, S., Wang, S., Ji, L., Li, A., Zhang, S., Liu, L., Hou, J., Gong, J., Zhao, X., and Qiu, X. Srpo: Self-referential 9 VLANeXt: Recipes for Building Strong VLA Models policy optimization for vision-language-action models. arXiv preprint arXiv:2511.15605, 2025a. Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., e...
-
[11]
Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025
Goyal, A., Hadfield, H., Yang, X., Blukis, V ., and Ramos, F. Vla-0: Building state-of-the-art vlas with zero modifica- tion.arXiv preprint arXiv:2510.13054,
-
[12]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Guo, W., Lu, G., Deng, H., Wu, Z., Tang, Y ., and Wang, Z. Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search. arXiv preprint arXiv:2509.22643,
-
[14]
Training Large Language Models to Reason in a Continuous Latent Space
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Huang, C.-P., Wu, Y .-H., Chen, M.-H., Wang, Y .-C. F., and Yang, F.-E. Thinkact: Vision-language-action reason- ing via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025a. Huang, J., Wang, S., Lin, F., Hu, Y ., Wen, C., and Gao, Y . Tactile-vla: unlocking vision-language-action model’s physical knowledge for tactile generalization.a...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
Huang, W., Wang, C., Li, Y ., Zhang, R., and Fei-Fei, L. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. InCoRL, 2025c. Hung, C.-Y ., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Majumder, N., Poria, S., et al. Nora: A small open- sourced generalist vision language action model for em- bodied tasks.arXiv pre...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dha- balia, K., DiCarlo, J., et al. pi0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025a. Intelligence, P., Black, K., Brown, N., Darpinian, J., Dha- balia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Kareer, S., Pertsch, K., Darpinian, J., Hoffman, J., Xu, D., Levine, S., Finn, C., and Nair, S. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414,
-
[19]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model. InCoRL, 2024a. Kim, M. J., Finn, C., and Liang, P. Fine-tuning vision- language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Not only rewards but also constraints: Applications on legged robot locomotion.TRO, 2024b
Kim, Y ., Oh, H., Lee, J., Choi, J., Ji, G., Jung, M., Youm, D., and Hwangbo, J. Not only rewards but also constraints: Applications on legged robot locomotion.TRO, 2024b. Kuang, F., You, J., Hu, Y ., Zhang, T., Wen, C., and Gao, Y . Adapt your body: Mitigating proprioception shifts in imitation learning.arXiv preprint arXiv:2506.23944,
-
[21]
MolmoAct: Action Reasoning Models that can Reason in Space
10 VLANeXt: Recipes for Building Strong VLA Models Lee, J., Duan, J., Fang, H., Deng, Y ., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y . R., Lee, S., et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Li, H., Yang, S., Chen, Y ., Tian, Y ., Yang, X., Chen, X., Wang, H., Wang, T., Zhao, F., Lin, D., et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a. Li, H., Zuo, Y ., Yu, J., Zhang, Y ., Yang, Z., Zhang, K., Zhu, X., Zhang, Y ., Chen, T., Cui, G., et al. Simplevla-rl...
-
[23]
Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975,
Liang, H., Chen, X., Wang, B., Chen, M., Liu, Y ., Zhang, Y ., Chen, Z., Yang, T., Chen, Y ., Pang, J., et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975,
-
[24]
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
Lu, G., Guo, W., Zhang, C., Zhou, Y ., Jiang, H., Gao, Z., Tang, Y ., and Wang, Z. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Lv, Q., Kong, W., Li, H., Zeng, J., Qiu, Z., Qu, D., Song, H., Chen, Q., Deng, X., and Pang, J. F1: A vision-language- action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
A Survey on Vision-Language-Action Models for Embodied AI
Ma, Y ., Song, Z., Zhuang, Y ., Hao, J., and King, I. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Transfer between Modalities with MetaQueries
Pan, X., Shukla, S. N., Singh, A., Zhao, Z., Mishra, S. K., Wang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F., Hou, J., and Xie, S. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Qu, D., Song, H., Chen, Q., Yao, Y ., Ye, X., Ding, Y ., Wang, Z., Gu, J., Zhao, B., Wang, D., et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
E., Otto, F., and Lioutikov, R
Reuss, M., Zhou, H., R ¨uhle, M., Ya ˘gmurlu, ¨O. E., Otto, F., and Lioutikov, R. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996,
-
[31]
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
11 VLANeXt: Recipes for Building Strong VLA Models Shi, H., Xie, B., Liu, Y ., Sun, L., Liu, F., Wang, T., Zhou, E., Fan, H., Zhang, X., and Huang, G. Memo- ryvla: Perceptual-cognitive memory in vision-language- action models for robotic manipulation.arXiv preprint arXiv:2508.19236,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al. Smolvla: A vision-language- action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Gemini Robotics: Bringing AI into the Physical World
Team, G. R., Abeyruwan, S., Ainslie, J., Alayrac, J.-B., Arenas, M. G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
End-to-end Listen, Look, Speak and Act
Wang, S., Yu, W., Chen, X., Tian, X., Zhang, J., Lu, L., and Zhang, C. End-to-end listen, look, speak and act.arXiv preprint arXiv:2510.16756, 2025b. Wang, Y ., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509....
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
Xiao, J., Yang, Y ., Chang, X., Chen, R., Xiong, F., Xu, M., Zheng, W.-S., and Zhang, Q. World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948, 2025a. Xiao, L., Li, J., Gao, J., Ye, F., Jin, Y ., Qian, J., Zhang, J., Wu, Y ., and Yu, X. Ava-vla: Improving vision-language- action models with activ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Zhang, J., Chen, Y ., Xu, Y ., Huang, Z., Zhou, Y ., Yuan, Y .-J., Cai, X., Huang, G., Quan, X., Xu, H., et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.arXiv preprint arXiv:2506.22242, 2025a. 12 VLANeXt: Recipes for Building Strong VLA Models Zhang, J., Guo, Y ., Hu, Y ., Chen, X., Zhu, X., and Chen, J. Up-vl...
-
[38]
Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge
Zhang, W., Liu, H., Qi, Z., Wang, Y ., Yu, X., Zhang, J., Dong, R., He, J., Lu, F., Wang, H., et al. Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge. InNeurIPS, 2025c. Zhang, Z., Zheng, K., Chen, Z., Jang, J., Li, Y ., Han, S., Wang, C., Ding, M., Fox, D., and Yao, H. Grape: Gener- alizing robot policy via preference ...
-
[39]
Zhong, Z., Yan, H., Li, J., Liu, X., Gong, X., Zhang, T., Song, W., Chen, J., Zheng, X., Wang, H., et al. Flowvla: Visual chain of thought-based motion reason- ing for vision-language-action models.arXiv preprint arXiv:2508.18269,
-
[40]
13 VLANeXt: Recipes for Building Strong VLA Models A. More Experimental Results A.1. Qualitative Experiments We present more demos of our model on the LIBERO and LIBERO-plus benchmarks, as well as in real-world settings (see Figures 10, 11, and 9). Additional video demonstrations of our experimental results are provided in the supplementary materials. (a)...
work page 2020
-
[41]
to enhance action generation, and designing post-training optimization like planning or reinforcement learning to adapt to specific environment (Guo et al., 2025; Zhang et al., 2025d; Bai et al., 2025c; Tan et al., 2025; Li et al., 2025b; Fei et al., 2025a; Chen et al., 2025b; Huang et al., 2025a; Lu et al., 2025; Xiao et al., 2025a;b). Additionally, a su...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.