Recognition: no theorem link
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
Pith reviewed 2026-05-11 01:01 UTC · model grok-4.3
The pith
Spectrally decomposing vision-language backbones into generalized and specialized experts enables VLA models to adapt to robotic control with few parameters while retaining pre-trained knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLA-GSE is initialized by spectrally decomposing the frozen backbone, assigning leading singular components to generalized experts and disjoint residual components to specialized experts. This decomposition improves adaptation capacity under a fixed trainable-parameter budget. Under a comparable parameter budget, VLA-GSE updates only 2.51% of the full model parameters and consistently outperforms strong FFT and PEFT baselines. It achieves 81.2% average zero-shot success on LIBERO-Plus, preserves pre-trained VLM capability comparably to LoRA on multimodal understanding benchmarks, and improves real-world manipulation success under multiple distribution shifts.
What carries the argument
Spectral decomposition of the frozen backbone that assigns leading singular components to generalized shared experts and disjoint residual components to specialized routed experts.
Load-bearing premise
The spectral decomposition will reliably separate useful shared knowledge from task-specific adaptations without losing critical information or introducing harmful interference.
What would settle it
A new robotic task where VLA-GSE produces lower success rates than LoRA or full fine-tuning, or where vision-language benchmark scores drop more than with standard PEFT, would show the claim is false.
Figures
read the original abstract
Vision-language-action (VLA) models inherit rich visual-semantic priors from pre-trained vision-language backbones, but adapting them to robotic control remains challenging. Full fine-tuning (FFT) is prone to overfitting on downstream robotic data and catastrophic forgetting of pretrained vision-language capabilities. Parameter-efficient fine-tuning (PEFT) better preserves pre-trained knowledge, yet existing PEFT methods still struggle to adapt effectively to robot control tasks. To address this gap, we propose VLA-GSE, a parameter-efficient VLA fine-tuning framework that improves control adaptation while retaining PEFT's knowledge preservation advantage. Specifically, VLA-GSE (Generalized and Specialized Experts) is initialized by spectrally decomposing the frozen backbone, assigning leading singular components to generalized experts (shared experts) and disjoint residual components to specialized experts (routed experts). This decomposition improves adaptation capacity under a fixed trainable-parameter budget. Under a comparable parameter budget, VLA-GSE updates only 2.51% of the full model parameters and consistently outperforms strong FFT and PEFT baselines. It achieves 81.2% average zero-shot success on LIBERO-Plus, preserves pre-trained VLM capability comparably to LoRA on multimodal understanding benchmarks, and improves real-world manipulation success under multiple distribution shifts. Code is available at: https://github.com/YuhuaJiang2002/VLA-GSE
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VLA-GSE, a parameter-efficient fine-tuning framework for vision-language-action (VLA) models. It initializes by spectrally decomposing each frozen backbone weight matrix via SVD, routing leading singular components to shared generalized experts and orthogonal residuals to routed specialized experts. Under a comparable parameter budget, the method updates only 2.51% of parameters, reports 81.2% average zero-shot success on LIBERO-Plus, preserves VLM capabilities comparably to LoRA on multimodal benchmarks, and improves real-world manipulation under distribution shifts.
Significance. If the spectral decomposition reliably separates shared priors from task-specific adaptations, the approach could meaningfully advance PEFT for VLAs by improving adaptation capacity without increasing the trainable-parameter budget or sacrificing pre-trained knowledge. The open-source code link is a positive factor for reproducibility.
major comments (2)
- [§3.2] §3.2 (SVD Initialization): The central performance claims rest on the assumption that leading singular components preferentially encode pre-trained visual-semantic priors while residuals capture robot-control adaptations. The manuscript provides no analysis of the singular-value spectrum across VLA layers, no comparison of the spectral split against random or non-spectral expert initialization, and no ablation isolating the contribution of the SVD step versus the MoE routing itself; without this, gains may be attributable to routing or expert count rather than the claimed decomposition.
- [§4.2] §4.2 and Table 2 (LIBERO-Plus Results): The reported 81.2% average zero-shot success and consistent outperformance over FFT and PEFT baselines are presented without standard deviations across runs, statistical significance tests, or explicit confirmation that all baselines were re-implemented under an identical 2.51% trainable-parameter constraint; this weakens attribution of the improvement specifically to the generalized/specialized expert split.
minor comments (1)
- [§3.2] The notation for the number of leading singular components (k) and the routing mechanism for specialized experts is introduced without a clear equation or pseudocode block, making the initialization procedure harder to follow on first reading.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the empirical support for the SVD initialization and to improve statistical reporting. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2] §3.2 (SVD Initialization): The central performance claims rest on the assumption that leading singular components preferentially encode pre-trained visual-semantic priors while residuals capture robot-control adaptations. The manuscript provides no analysis of the singular-value spectrum across VLA layers, no comparison of the spectral split against random or non-spectral expert initialization, and no ablation isolating the contribution of the SVD step versus the MoE routing itself; without this, gains may be attributable to routing or expert count rather than the claimed decomposition.
Authors: We agree that additional analysis would strengthen the attribution of gains to the spectral decomposition. In the revised manuscript we will add: (1) plots and quantitative analysis of singular-value spectra across VLA layers to illustrate the concentration of pre-trained energy in leading components, (2) an ablation that replaces SVD initialization with random initialization while keeping the MoE routing structure, expert count, and parameter budget identical, and (3) results that isolate the SVD step from the routing mechanism. These additions will directly address whether performance improvements derive from the claimed decomposition rather than routing alone. revision: yes
-
Referee: [§4.2] §4.2 and Table 2 (LIBERO-Plus Results): The reported 81.2% average zero-shot success and consistent outperformance over FFT and PEFT baselines are presented without standard deviations across runs, statistical significance tests, or explicit confirmation that all baselines were re-implemented under an identical 2.51% trainable-parameter constraint; this weakens attribution of the improvement specifically to the generalized/specialized expert split.
Authors: We acknowledge the need for greater statistical rigor. In the revised manuscript we will: (1) report standard deviations computed over multiple independent runs for all LIBERO-Plus results, (2) include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) against the baselines, and (3) explicitly state in the methods section and Table 2 caption that every baseline was re-implemented under the identical 2.51% trainable-parameter budget. These changes will strengthen the attribution of improvements to the generalized/specialized expert design. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents VLA-GSE as an empirical PEFT framework whose core step is an SVD-based spectral decomposition of frozen weights to initialize generalized (leading singular) and specialized (residual) experts. This is a design choice, not a derivation. Reported results (81.2% LIBERO-Plus success, 2.51% trainable parameters, outperformance of baselines) are measured outcomes on external benchmarks rather than quantities that reduce to the initialization by construction. No equations, self-citations, or fitted inputs are shown that would make any performance claim tautological. The method remains self-contained against external evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
doi: 10.48550/arXiv.2511.21631. Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wan...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631
-
[3]
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy X...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.15607/rss.2025.xxi.010 2025
-
[4]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
URL https://arxiv.org/ abs/2505.06111. Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. WorldVLA: Towards autoregressive action world model,
work page internal anchor Pith review arXiv
-
[5]
WorldVLA: Towards Autoregressive Action World Model
URL https://arxiv.org/abs/2506.21539. Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024a. Xia...
work page internal anchor Pith review arXiv
-
[6]
Diffusion policy: Visuomotor policy learning via action diffusion
doi: 10.15607/RSS.2023.XIX.026. StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing,
-
[7]
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
URL https://arxiv.org/abs/2604.05014. Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arX...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Revla: Reverting visual domain limitation of robotic foundation models, 2024
Sombit Dey, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, and Danda Pani Paudel. Revla: Reverting visual domain limitation of robotic foundation models.arXiv preprint arXiv:2409.15250,
-
[9]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,
work page internal anchor Pith review arXiv
-
[10]
doi: 10.1177/02783649241281508. Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, and Lin Shao. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language-action models.arXiv preprint arXiv:2506.17561,
-
[11]
Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
doi: 10.48550/arXiv.2503. 19786. Asher J. Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Majumdar. Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195,
-
[13]
Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025
URLhttps://openreview.net/forum?id=nZeVKeeFYf9. Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,
-
[14]
URL https://arxiv.org/abs/2504.19854. Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. InInternational Conference on Learning Representations,
-
[15]
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
URLhttps://arxiv.org/abs/2511.14148. 11 Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora.arXiv preprint arXiv:2312.03732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
URLhttps://arxiv.org/abs/2312.03732. Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic VLMs: Investigating the design space of visually-conditioned language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editor...
-
[17]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
URL https://proceedings.mlr.press/v235/karamcheti24a.html. Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025a. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, ...
work page internal anchor Pith review arXiv
-
[18]
MolmoAct: Action Reasoning Models that can Reason in Space
doi: 10.48550/arXiv.2508.07917. Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.Advances in Neural Information Processing Systems,
work page internal anchor Pith review doi:10.48550/arxiv.2508.07917
-
[19]
Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024a. Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zh...
-
[20]
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen
URL https://proceedings.neurips.cc/ paper_files/paper/2023/hash/8c3c666820ea055a77726d66fc7d447f-Abstract-Datasets_ and_Benchmarks.html. Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InProceedings of the 41st International Conference on Mac...
2023
-
[21]
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
doi: 10.52202/079017-3846. URL https://proceedings.neurips.cc/paper_files/ paper/2024/hash/db36f4d603cc9e3a2a5e10b93e6428f2-Abstract-Conference.html. NVIDIA GEAR Team. GR00T N1.6: An improved open foundation model for generalist humanoid robots. https://research.nvidia.com/labs/gear/gr00t-n1_6/,
-
[24]
arXiv preprint arXiv:2512.11921 , year =
URL https://arxiv.org/abs/2512.11921. 12 Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models,
-
[25]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
URL https://arxiv.org/abs/2501.09747. Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya ˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-flow models. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceed...
work page internal anchor Pith review arXiv
-
[26]
Interactive post-training for vision-language- action models, 2025
URLhttps://arxiv.org/abs/2505.17016. Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Chengzhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.arXiv preprint arXiv:2404.19245,
-
[27]
arXiv preprint arXiv:2412.06071 , year=
Fan Wang, Juyong Jiang, Chansung Park, Sunghun Kim, and Jing Tang. Kasa: Knowledge-aware singular-value adaptation of large language models.arXiv preprint arXiv:2412.06071, 2024a. Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing minor singular components for parameter-efficient llm finetuning.arXiv preprint arXiv:2406.0904...
-
[28]
doi: 10.18653/v1/2025.naacl-long.248
Association for Computational Linguistics. doi: 10.18653/v1/2025.naacl-long.248. URLhttps://aclanthology.org/2025.naacl-long.248/. Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. InProcee...
-
[29]
arXiv preprint arXiv:2602.18532 , year=
Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, and Chen Change Loy. Vlanext: Recipes for building strong vla models.arXiv preprint arXiv:2602.18532,
-
[30]
Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520,
-
[31]
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Yu Zhang, Yuqi Xie, Huihan Liu, Rutav Shah, Michael Wan, Linxi Fan, and Yuke Zhu
Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, and Kai Chen. Twinbrainvla: Unleashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133,
-
[33]
arXiv preprint arXiv:2601.03309 , year=
URL https://proceedings.mlr.press/v270/ zawalski25a.html. Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309,
-
[34]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
13 Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274,
work page internal anchor Pith review arXiv
-
[35]
14 A Technical Appendices and Supplementary Material A.1 Limitations and Future Works A limitation of the current study is that our evaluation is still restricted to robotics-oriented settings, including embodied policy learning and real-world manipulation tasks, and we have not yet validated the proposed framework in other multimodal domains that also re...
2018
-
[36]
Method Params (%) Spatial Object Goal Long Average FFT (Full Fine-Tuning) 100.00 97.8 98.8 96.4 94.2 96.8 LoRA [Hu et al., 2022] 2.58 94.4 96.8 90.4 78.4 90.0 MiLoRA [Wang et al., 2025] 2.52 97.0 98.2 95.0 84.2 93.6 PiSSA [Meng et al., 2024] 2.54 96.0 97.4 93.6 89.8 94.2 rsLoRA [Kalajdzievski, 2023] 2.52 96.4 98.0 94.0 90.8 94.8 HydraLoRA [Tian et al., 20...
2022
-
[37]
Left Brain
provide explicit empirical analysis of the severity of forgetting during VLA training. They report that standard robot-only VLA fine-tuning can cause a near-complete collapse of general visual understanding; for example, the POPE score of Qwen3-VL drops from 88.87% to 0.04% after fine-tuning. They also show that simple co-training with general visual QA d...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.