pith. machine review for the scientific record. sign in

arxiv: 2605.06175 · v2 · submitted 2026-05-07 · 💻 cs.RO

Recognition: no theorem link

VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:01 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-actionparameter-efficient fine-tuninggeneralized expertsspecialized expertsspectral decompositionrobotic controlLIBERO benchmark
0
0 comments X

The pith

Spectrally decomposing vision-language backbones into generalized and specialized experts enables VLA models to adapt to robotic control with few parameters while retaining pre-trained knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models start with strong visual and language priors from pre-training, yet adapting them to robot tasks often leads to forgetting those priors or poor control performance. Full fine-tuning overfits on limited robot data, while standard parameter-efficient methods adapt too weakly. VLA-GSE solves this by breaking the frozen backbone via singular value decomposition, routing the largest components to shared generalized experts and the remaining parts to task-specific specialized experts. This split increases adaptation power without raising the trainable parameter count. A sympathetic reader would care because it makes it feasible to deploy capable robots across varied settings using far less computation and storage than retraining from scratch.

Core claim

VLA-GSE is initialized by spectrally decomposing the frozen backbone, assigning leading singular components to generalized experts and disjoint residual components to specialized experts. This decomposition improves adaptation capacity under a fixed trainable-parameter budget. Under a comparable parameter budget, VLA-GSE updates only 2.51% of the full model parameters and consistently outperforms strong FFT and PEFT baselines. It achieves 81.2% average zero-shot success on LIBERO-Plus, preserves pre-trained VLM capability comparably to LoRA on multimodal understanding benchmarks, and improves real-world manipulation success under multiple distribution shifts.

What carries the argument

Spectral decomposition of the frozen backbone that assigns leading singular components to generalized shared experts and disjoint residual components to specialized routed experts.

Load-bearing premise

The spectral decomposition will reliably separate useful shared knowledge from task-specific adaptations without losing critical information or introducing harmful interference.

What would settle it

A new robotic task where VLA-GSE produces lower success rates than LoRA or full fine-tuning, or where vision-language benchmark scores drop more than with standard PEFT, would show the claim is false.

Figures

Figures reproduced from arXiv: 2605.06175 by Feifei Gao, Junjie Lu, Kaixin Wang, Li Zhao, Xiaoyu Chen, Xinyao Qin, Yuhua Jiang.

Figure 1
Figure 1. Figure 1: Overview of the VLA-GSE framework. (Top Right) VLA-GSE uses an SVD-based adaptive priors scheme to initialize generalized experts and multiple specialized experts based on sorted singular value segments. (Bottom Right) During the forward pass, the output is formed by summing the input transformations induced by the adjusted frozen pre-trained weight W˜ 0, the generalized expert, and the dynamically selecte… view at source ↗
Figure 2
Figure 2. Figure 2: Attention map visualizations for different models under the LIBERO-Plus view at source ↗
Figure 3
Figure 3. Figure 3: Real-world generalization settings. The four train-test pairs correspond to variations in lighting, language, clutter layout, and background tablecloth, respectively. For language variations, the training instruction is “Put carrot into bowl.” At test time, we replace it with “Place the red vegetable into the container.” view at source ↗
Figure 4
Figure 4. Figure 4: Extended attention map visualizations under seven diverse generalization scenarios. As view at source ↗
Figure 5
Figure 5. Figure 5: Real-world inference latency comparison on an NVIDIA RTX 5080 GPU. We report view at source ↗
read the original abstract

Vision-language-action (VLA) models inherit rich visual-semantic priors from pre-trained vision-language backbones, but adapting them to robotic control remains challenging. Full fine-tuning (FFT) is prone to overfitting on downstream robotic data and catastrophic forgetting of pretrained vision-language capabilities. Parameter-efficient fine-tuning (PEFT) better preserves pre-trained knowledge, yet existing PEFT methods still struggle to adapt effectively to robot control tasks. To address this gap, we propose VLA-GSE, a parameter-efficient VLA fine-tuning framework that improves control adaptation while retaining PEFT's knowledge preservation advantage. Specifically, VLA-GSE (Generalized and Specialized Experts) is initialized by spectrally decomposing the frozen backbone, assigning leading singular components to generalized experts (shared experts) and disjoint residual components to specialized experts (routed experts). This decomposition improves adaptation capacity under a fixed trainable-parameter budget. Under a comparable parameter budget, VLA-GSE updates only 2.51% of the full model parameters and consistently outperforms strong FFT and PEFT baselines. It achieves 81.2% average zero-shot success on LIBERO-Plus, preserves pre-trained VLM capability comparably to LoRA on multimodal understanding benchmarks, and improves real-world manipulation success under multiple distribution shifts. Code is available at: https://github.com/YuhuaJiang2002/VLA-GSE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VLA-GSE, a parameter-efficient fine-tuning framework for vision-language-action (VLA) models. It initializes by spectrally decomposing each frozen backbone weight matrix via SVD, routing leading singular components to shared generalized experts and orthogonal residuals to routed specialized experts. Under a comparable parameter budget, the method updates only 2.51% of parameters, reports 81.2% average zero-shot success on LIBERO-Plus, preserves VLM capabilities comparably to LoRA on multimodal benchmarks, and improves real-world manipulation under distribution shifts.

Significance. If the spectral decomposition reliably separates shared priors from task-specific adaptations, the approach could meaningfully advance PEFT for VLAs by improving adaptation capacity without increasing the trainable-parameter budget or sacrificing pre-trained knowledge. The open-source code link is a positive factor for reproducibility.

major comments (2)
  1. [§3.2] §3.2 (SVD Initialization): The central performance claims rest on the assumption that leading singular components preferentially encode pre-trained visual-semantic priors while residuals capture robot-control adaptations. The manuscript provides no analysis of the singular-value spectrum across VLA layers, no comparison of the spectral split against random or non-spectral expert initialization, and no ablation isolating the contribution of the SVD step versus the MoE routing itself; without this, gains may be attributable to routing or expert count rather than the claimed decomposition.
  2. [§4.2] §4.2 and Table 2 (LIBERO-Plus Results): The reported 81.2% average zero-shot success and consistent outperformance over FFT and PEFT baselines are presented without standard deviations across runs, statistical significance tests, or explicit confirmation that all baselines were re-implemented under an identical 2.51% trainable-parameter constraint; this weakens attribution of the improvement specifically to the generalized/specialized expert split.
minor comments (1)
  1. [§3.2] The notation for the number of leading singular components (k) and the routing mechanism for specialized experts is introduced without a clear equation or pseudocode block, making the initialization procedure harder to follow on first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the empirical support for the SVD initialization and to improve statistical reporting. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (SVD Initialization): The central performance claims rest on the assumption that leading singular components preferentially encode pre-trained visual-semantic priors while residuals capture robot-control adaptations. The manuscript provides no analysis of the singular-value spectrum across VLA layers, no comparison of the spectral split against random or non-spectral expert initialization, and no ablation isolating the contribution of the SVD step versus the MoE routing itself; without this, gains may be attributable to routing or expert count rather than the claimed decomposition.

    Authors: We agree that additional analysis would strengthen the attribution of gains to the spectral decomposition. In the revised manuscript we will add: (1) plots and quantitative analysis of singular-value spectra across VLA layers to illustrate the concentration of pre-trained energy in leading components, (2) an ablation that replaces SVD initialization with random initialization while keeping the MoE routing structure, expert count, and parameter budget identical, and (3) results that isolate the SVD step from the routing mechanism. These additions will directly address whether performance improvements derive from the claimed decomposition rather than routing alone. revision: yes

  2. Referee: [§4.2] §4.2 and Table 2 (LIBERO-Plus Results): The reported 81.2% average zero-shot success and consistent outperformance over FFT and PEFT baselines are presented without standard deviations across runs, statistical significance tests, or explicit confirmation that all baselines were re-implemented under an identical 2.51% trainable-parameter constraint; this weakens attribution of the improvement specifically to the generalized/specialized expert split.

    Authors: We acknowledge the need for greater statistical rigor. In the revised manuscript we will: (1) report standard deviations computed over multiple independent runs for all LIBERO-Plus results, (2) include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) against the baselines, and (3) explicitly state in the methods section and Table 2 caption that every baseline was re-implemented under the identical 2.51% trainable-parameter budget. These changes will strengthen the attribution of improvements to the generalized/specialized expert design. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents VLA-GSE as an empirical PEFT framework whose core step is an SVD-based spectral decomposition of frozen weights to initialize generalized (leading singular) and specialized (residual) experts. This is a design choice, not a derivation. Reported results (81.2% LIBERO-Plus success, 2.51% trainable parameters, outperformance of baselines) are measured outcomes on external benchmarks rather than quantities that reduce to the initialization by construction. No equations, self-citations, or fitted inputs are shown that would make any performance claim tautological. The method remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the method assumes singular value decomposition cleanly separates generalized from specialized knowledge, but no explicit free parameters, axioms, or invented entities are detailed beyond the expert split itself.

pith-pipeline@v0.9.0 · 5569 in / 1156 out tokens · 28035 ms · 2026-05-11T01:01:07.950709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 32 canonical work pages · 15 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  2. [2]

    doi: 10.48550/arXiv.2511.21631. Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wan...

  3. [3]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy X...

  4. [4]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    URL https://arxiv.org/ abs/2505.06111. Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. WorldVLA: Towards autoregressive action world model,

  5. [5]

    WorldVLA: Towards Autoregressive Action World Model

    URL https://arxiv.org/abs/2506.21539. Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024a. Xia...

  6. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion

    doi: 10.15607/RSS.2023.XIX.026. StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing,

  7. [7]

    StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

    URL https://arxiv.org/abs/2604.05014. Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arX...

  8. [8]

    Revla: Reverting visual domain limitation of robotic foundation models, 2024

    Sombit Dey, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, and Danda Pani Paudel. Revla: Reverting visual domain limitation of robotic foundation models.arXiv preprint arXiv:2409.15250,

  9. [9]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,

  10. [10]

    Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, and Lin Shao

    doi: 10.1177/02783649241281508. Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, and Lin Shao. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language-action models.arXiv preprint arXiv:2506.17561,

  11. [11]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  12. [12]

    doi: 10.48550/arXiv.2503. 19786. Asher J. Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Majumdar. Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195,

  13. [13]

    Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025

    URLhttps://openreview.net/forum?id=nZeVKeeFYf9. Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

  14. [14]

    Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

    URL https://arxiv.org/abs/2504.19854. Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. InInternational Conference on Learning Representations,

  15. [15]

    AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

    URLhttps://arxiv.org/abs/2511.14148. 11 Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora.arXiv preprint arXiv:2312.03732,

  16. [16]

    Kalajdzievski

    URLhttps://arxiv.org/abs/2312.03732. Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic VLMs: Investigating the design space of visually-conditioned language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editor...

  17. [17]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    URL https://proceedings.mlr.press/v235/karamcheti24a.html. Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025a. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, ...

  18. [18]

    MolmoAct: Action Reasoning Models that can Reason in Space

    doi: 10.48550/arXiv.2508.07917. Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.Advances in Neural Information Processing Systems,

  19. [19]

    Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

    Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024a. Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zh...

  20. [20]

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen

    URL https://proceedings.neurips.cc/ paper_files/paper/2023/hash/8c3c666820ea055a77726d66fc7d447f-Abstract-Datasets_ and_Benchmarks.html. Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InProceedings of the 41st International Conference on Mac...

  21. [21]

    A Survey on Vision-Language-Action Models for Embodied AI

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093,

  22. [22]

    URL https://proceedings.neurips.cc/paper_files/ paper/2024/hash/db36f4d603cc9e3a2a5e10b93e6428f2-Abstract-Conference.html

    doi: 10.52202/079017-3846. URL https://proceedings.neurips.cc/paper_files/ paper/2024/hash/db36f4d603cc9e3a2a5e10b93e6428f2-Abstract-Conference.html. NVIDIA GEAR Team. GR00T N1.6: An improved open foundation model for generalist humanoid robots. https://research.nvidia.com/labs/gear/gr00t-n1_6/,

  23. [24]

    arXiv preprint arXiv:2512.11921 , year =

    URL https://arxiv.org/abs/2512.11921. 12 Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models,

  24. [25]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    URL https://arxiv.org/abs/2501.09747. Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya ˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-flow models. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceed...

  25. [26]

    Interactive post-training for vision-language- action models, 2025

    URLhttps://arxiv.org/abs/2505.17016. Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Chengzhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.arXiv preprint arXiv:2404.19245,

  26. [27]

    arXiv preprint arXiv:2412.06071 , year=

    Fan Wang, Juyong Jiang, Chansung Park, Sunghun Kim, and Jing Tang. Kasa: Knowledge-aware singular-value adaptation of large language models.arXiv preprint arXiv:2412.06071, 2024a. Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing minor singular components for parameter-efficient llm finetuning.arXiv preprint arXiv:2406.0904...

  27. [28]

    doi: 10.18653/v1/2025.naacl-long.248

    Association for Computational Linguistics. doi: 10.18653/v1/2025.naacl-long.248. URLhttps://aclanthology.org/2025.naacl-long.248/. Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. InProcee...

  28. [29]

    arXiv preprint arXiv:2602.18532 , year=

    Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, and Chen Change Loy. Vlanext: Recipes for building strong vla models.arXiv preprint arXiv:2602.18532,

  29. [30]

    Instructvla: Vision-language-action instruction tuning from understanding to manipulation.arXiv preprint arXiv:2507.17520, 2025

    Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520,

  30. [31]

    ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

    Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236,

  31. [32]

    Yu Zhang, Yuqi Xie, Huihan Liu, Rutav Shah, Michael Wan, Linxi Fan, and Yuke Zhu

    Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, and Kai Chen. Twinbrainvla: Unleashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133,

  32. [33]

    arXiv preprint arXiv:2601.03309 , year=

    URL https://proceedings.mlr.press/v270/ zawalski25a.html. Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309,

  33. [34]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    13 Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274,

  34. [35]

    14 A Technical Appendices and Supplementary Material A.1 Limitations and Future Works A limitation of the current study is that our evaluation is still restricted to robotics-oriented settings, including embodied policy learning and real-world manipulation tasks, and we have not yet validated the proposed framework in other multimodal domains that also re...

  35. [36]

    Method Params (%) Spatial Object Goal Long Average FFT (Full Fine-Tuning) 100.00 97.8 98.8 96.4 94.2 96.8 LoRA [Hu et al., 2022] 2.58 94.4 96.8 90.4 78.4 90.0 MiLoRA [Wang et al., 2025] 2.52 97.0 98.2 95.0 84.2 93.6 PiSSA [Meng et al., 2024] 2.54 96.0 97.4 93.6 89.8 94.2 rsLoRA [Kalajdzievski, 2023] 2.52 96.4 98.0 94.0 90.8 94.8 HydraLoRA [Tian et al., 20...

  36. [37]

    Left Brain

    provide explicit empirical analysis of the severity of forgetting during VLA training. They report that standard robot-only VLA fine-tuning can cause a near-complete collapse of general visual understanding; for example, the POPE score of Qwen3-VL drops from 88.87% to 0.04% after fine-tuning. They also show that simple co-training with general visual QA d...