Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation

Chang Xu; Fengnian Zhang; Siyu Xu; Tao Huang; Zhong Jin

arxiv: 2606.31382 · v1 · pith:QULBAS52new · submitted 2026-06-30 · 💻 cs.RO

Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation

Fengnian Zhang , Tao Huang , Siyu Xu , Zhong Jin , Chang Xu This is my paper

Pith reviewed 2026-07-01 05:05 UTC · model grok-4.3

classification 💻 cs.RO

keywords parameter pruningvision-language-action modelsVLM-to-VLA adaptationmodel compressionrobotic policiesLIBERO benchmarkparameter redundancy

0 comments

The pith

Pruning VLA models via adaptation divergence removes 12-30% parameters while retaining 90% performance without recovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates parameter changes that occur when pre-trained vision-language models are adapted into vision-language-action models for controlling robots. It measures how parameters diverge across modules during this adaptation and treats direct performance impact from removal, without any recovery steps, as evidence of which parameters matter functionally. From the resulting modular patterns, it constructs a joint pruning method that targets subsets differently by module. On the LIBERO benchmark this yields 12-30% fewer parameters in OpenVLA and π0.5 while keeping roughly 90% of original task success, whereas magnitude-based or other standard criteria cause complete failure under the same no-recovery rule. The work therefore questions whether observed pruning fragility reflects true redundancy or simply the removal of adaptation-critical weights.

Core claim

Quantifying spatial patterns of parameter divergence during VLM-to-VLA adaptation reveals structured modular heterogeneities; controlled pruning without fine-tuning then serves as a diagnostic that links these divergence signals to functional importance, enabling a multi-module joint pruning scheme that reduces parameters of OpenVLA and π0.5 by 12-30% while preserving approximately 90% of original LIBERO performance.

What carries the argument

Multi-module joint pruning scheme that selects parameter subsets according to their measured divergence signals across modules during adaptation.

If this is right

VLA performance degradation after pruning is not an inevitable consequence of parameter reduction if subsets are chosen according to adaptation divergence.
Standard pruning criteria that ignore adaptation history discard functionally critical weights under recovery-free evaluation.
Efficient robotic policies can be obtained directly from adapted models without subsequent fine-tuning or low-rank corrections.
Parameter redundancy in VLAs is structured by module rather than uniformly distributed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same divergence-probe method could be applied to diagnose redundancy in other adaptation settings such as language-to-vision or multi-task fine-tuning.
Tracking divergence during training might allow early stopping or regularization choices that reduce redundancy before full adaptation completes.
If the modular patterns prove stable across different base VLMs, the pruning ratios could be predicted from adaptation statistics alone without running the full diagnostic.

Load-bearing premise

That the immediate performance change after removing a parameter subset without any fine-tuning directly reveals whether those parameters carry functional importance induced by adaptation.

What would settle it

Running the same controlled pruning experiment on a new VLA model or task where the proposed divergence-based subsets produce performance collapse comparable to random or magnitude-based removal.

Figures

Figures reproduced from arXiv: 2606.31382 by Chang Xu, Fengnian Zhang, Siyu Xu, Tao Huang, Zhong Jin.

**Figure 1.** Figure 1: Visualizing the relative parameter divergence ∆Wrel between the Prismatic (VLM) and OpenVLA (VLA) model pair. The color intensity indicates the magnitude of divergence: darker blue denotes significant parameter shifts, while brighter yellow represents minimal change. Subfigures (a)–(g) display the divergence across different modules, calculated at the granularity of individual attention heads or FFN channe… view at source ↗

**Figure 2.** Figure 2: Visualizing the relative parameter divergence ∆Wrel for the PaLI-Gemma (VLM) and π0.5 (VLA) model pair. The color intensity indicates the magnitude of divergence: darker blue denotes significant parameter shifts, while brighter yellow represents minimal change. Subfigures (a)–(d) display the divergence across different modules, calculated at the granularity of individual attention heads or FFN channels. Th… view at source ↗

**Figure 3.** Figure 3: Causal analysis of recovery difficulty: Convergence steps vs. Pruning ratio. (∼95.0% SR), while removing lowest-diff channels causes significant drops. Language model attention also collapses (0.0%) only when the lowest-diff heads are removed. These cross-model observations confirm Hypothesis II: VLM-to-VLA parameter divergence contains structured signals that effectively distinguish vital from redundant… view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have made significant strides in embodied intelligence by integrating the powerful representations of pre-trained Vision-Language Models (VLMs). However, the massive parameter scale of VLAs imposes a heavy computational burden, and these models exhibit extreme sensitivity to parameter pruning. Current paradigms often treat the resulting performance degradation as inevitable, relying on fine-tuning or low-rank corrections to recover efficacy. We challenge this convention by questioning whether the removed parameters are truly redundant if VLA pruning necessitates performance recovery to be effective, or if this paradigm masks the indiscriminate pruning of critical parameters. We revisit parameter redundancy through the lens of VLM-to-VLA adaptation, first quantifying the spatial distribution of parameter divergence during adaptation to reveal structured patterns across different modules. Subsequently, we introduce controlled pruning as a diagnostic probe: by comparing the direct impact of removing different parameter subsets on VLA performance without any fine-tuning, we establish a causal link between adaptation-induced divergence signals and functional contributions. Based on the discovered modular heterogeneities, we design a multi-module joint pruning scheme. Evaluations on the LIBERO benchmark demonstrate that our approach reduces the parameters of OpenVLA and $\pi_{0.5}$ by 12\%--30\% while maintaining approximately 90\% of the original performance without any post-pruning recovery. In contrast, existing parameter pruning criteria result in total performance collapse when evaluated under the same recovery-free constraints. Our study reveals the parameter evolution mechanism in VLA adaptation and provides a new path for deploying efficient, robust robotic policies in resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows adaptation-divergence pruning can cut 12-30% of VLA parameters with ~90% LIBERO retention and no recovery, unlike baselines, but the causal link to the divergence metric itself may be confounded by module position or scale.

read the letter

The main point is that this work finds a pruning scheme guided by how parameters diverge during VLM-to-VLA adaptation, which removes 12-30% of parameters from OpenVLA and π0.5 while holding onto about 90% performance on LIBERO without any fine-tuning afterward. Standard pruning criteria collapse under the same no-recovery test.

What is new is the use of adaptation divergence patterns to spot modular differences and then build a joint pruning method around them. Treating direct no-recovery pruning as a diagnostic probe to connect those signals to actual task contribution is a straightforward way to test importance, and it gives a clean head-to-head on the benchmark.

The experiments are useful because they focus on a practical constraint for robotics deployment and use a public dataset. The framing also usefully questions why recovery is treated as necessary in the first place.

The soft spot is the causality step. Divergence often tracks with layer depth or weight magnitude, both of which can affect performance on their own. Without ablations that match or control for those factors, the performance edge could come from pruning certain modules rather than from following the divergence signal specifically. The abstract describes the quantification and the probe, but if the full paper does not add those extra checks, the interpretation stays suggestive.

This is for researchers working on efficient VLAs and model compression for embodied systems. It offers a concrete efficiency angle worth testing.

Send it to peer review. The no-recovery result is worth referee scrutiny even if the mechanism needs tighter isolation.

Referee Report

2 major / 2 minor

Summary. The paper claims that quantifying parameter divergence during VLM-to-VLA adaptation reveals modular heterogeneities; a controlled no-recovery pruning probe then establishes that divergence signals mark functionally important parameters; and a resulting multi-module joint pruning scheme reduces parameters in OpenVLA and π0.5 by 12–30% while retaining ~90% of original LIBERO performance, whereas standard criteria cause total collapse under identical recovery-free conditions.

Significance. If the central empirical result holds after addressing the causality concern, the work supplies a concrete, recovery-free route to smaller VLAs and new evidence on how adaptation redistributes parameter importance. The use of a public benchmark (LIBERO) and the explicit no-fine-tuning diagnostic are strengths that support reproducibility and falsifiability.

major comments (2)

[Abstract and controlled pruning experiment] Abstract (controlled pruning paragraph) and the corresponding experimental section: the claim that the performance gap demonstrates a causal link between adaptation-induced divergence and functional contribution is load-bearing yet rests on an unablated comparison. Divergence is known to correlate with layer depth and weight magnitude in VLM-to-VLA transfer; without explicit matching or regression controls on these covariates, the observed superiority over baselines could be explained by module choice rather than the divergence metric itself.
[Experimental results on LIBERO] Experimental results section (LIBERO tables/figures): the reported 12–30% reduction and ~90% retention figures lack accompanying details on exact divergence quantification (e.g., which norm or distance), per-module pruning thresholds, baseline implementations, number of seeds, and statistical tests. These omissions prevent verification that the performance retention is robust and not an artifact of particular hyper-parameter choices.

minor comments (2)

[Throughout] Notation: the symbol π0.5 appears inconsistently (sometimes with subscript, sometimes without); standardize throughout.
[Figures] Figure clarity: ensure that divergence heat-maps or module-wise plots include axis labels, color-bar scales, and error bars where applicable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for strengthening the causal interpretation and improving experimental transparency. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and controlled pruning experiment] Abstract (controlled pruning paragraph) and the corresponding experimental section: the claim that the performance gap demonstrates a causal link between adaptation-induced divergence and functional contribution is load-bearing yet rests on an unablated comparison. Divergence is known to correlate with layer depth and weight magnitude in VLM-to-VLA transfer; without explicit matching or regression controls on these covariates, the observed superiority over baselines could be explained by module choice rather than the divergence metric itself.

Authors: We thank the referee for pointing out this potential confound. While our experiments compare the performance impact of pruning based on divergence versus standard criteria like magnitude, we did not explicitly control for correlations with layer depth and weight magnitude through matching or regression. We agree this is a valid concern and will incorporate additional controls and ablations in the revised manuscript to better isolate the effect of the divergence metric. revision: yes
Referee: [Experimental results on LIBERO] Experimental results section (LIBERO tables/figures): the reported 12–30% reduction and ~90% retention figures lack accompanying details on exact divergence quantification (e.g., which norm or distance), per-module pruning thresholds, baseline implementations, number of seeds, and statistical tests. These omissions prevent verification that the performance retention is robust and not an artifact of particular hyper-parameter choices.

Authors: We agree with the need for more details to ensure reproducibility. The revised manuscript will include: the specific divergence measure (L2 norm on parameter deltas), the exact per-module pruning thresholds, full descriptions of baseline methods, the number of evaluation seeds, and appropriate statistical tests for the reported results. revision: yes

Circularity Check

0 steps flagged

Empirical ablation on public benchmark exhibits no circularity

full rationale

The paper conducts an empirical study: it quantifies parameter divergence across modules during VLM-to-VLA adaptation, then applies controlled pruning (no recovery fine-tuning) on the LIBERO benchmark to measure performance retention. Reported results (12-30% parameter reduction at ~90% performance for OpenVLA and π0.5) are direct benchmark outcomes, not quantities defined by or fitted to the divergence metric itself. No equations, self-citations, or ansatzes reduce the performance claims to inputs by construction; the work remains self-contained against an external public benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on empirical observations of parameter divergence and recovery-free pruning tests.

pith-pipeline@v0.9.1-grok · 5826 in / 1064 out tokens · 33581 ms · 2026-07-01T05:05:03.157112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 28 canonical work pages · 13 internal anchors

[1]

Fluctuation-based adaptive structured pruning for large language models

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10865–10873, 2024

2024
[2]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang,DanielSalz,MaximNeumann,IbrahimAlabdulmohsin,MichaelTschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

Yuxuan Chen and Xiao Li. Rlrc: Reinforcement learning-based recovery for com- pressed vision-language-action models.arXiv preprint arXiv:2506.17639, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowd- hery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Sqap-vla: A synergistic quantization-aware pruning framework for high-performance vision- language-action models.arXiv preprint arXiv:2509.09090, 2025

Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, and Huanrui Yang. Sqap-vla: A synergistic quantization-aware pruning framework for high-performance vision- language-action models.arXiv preprint arXiv:2509.09090, 2025

work page arXiv 2025
[7]

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Weifan Guan, Qinghao Hu, Aosheng Li, and Jian Cheng. Efficient vision-language- action models for embodied manipulation: A systematic survey.arXiv preprint arXiv:2510.17111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

Asher J Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Ma- jumdar. Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

work page arXiv 2025
[9]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fu- sai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Per...

2025
[10]

Don’t run with scissors: Pruning breaks vla models but they can be recovered.arXiv preprint arXiv:2510.08464, 2025

Jason Jabbour, Dong-Ki Kim, Max Smith, Jay Patrikar, Radhika Ghosal, Youhui Wang, Ali Agha, Vijay Janapa Reddi, and Shayegan Omidshafiei. Don’t run with scissors: Pruning breaks vla models but they can be recovered.arXiv preprint arXiv:2510.08464, 2025

work page arXiv 2025
[11]

Prismatic vlms: Investigating the design space of visually- conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kol- lar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually- conditioned language models. InForty-first International Conference on Machine Learning, 2024

2024
[12]

Vision-language-action models for robotics: A review towards real-world applica- tions.IEEE Access, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applica- tions.IEEE Access, 2025

2025
[13]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipula- tion.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

Wenhao Li, Xiu Su, Yichao Cao, Hongyan Xu, Xiaobo Xia, Shan You, Yi Chen, and Chang Xu. Vla-attc: Adaptive test-time compute for vla models with relative action critic model.arXiv preprint arXiv:2605.01194, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Sp-vla:Ajointmodelschedulingandtoken pruning approach for vla model acceleration.arXiv preprint arXiv:2506.12723, 2025

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, ShutaoXia,ZhiWang,andWenwuZhu. Sp-vla:Ajointmodelschedulingandtoken pruning approach for vla model acceleration.arXiv preprint arXiv:2506.12723, 2025

work page arXiv 2025
[17]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Pe- ter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023
[18]

Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Ad- vances in Neural Information Processing Systems, 37:40085–40110, 2024

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Ad- vances in Neural Information Processing Systems, 37:40085–40110, 2024

2024
[19]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing sys- tems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing sys- tems, 36:21702–21720, 2023

2023
[20]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Mech- anistic finetuning of vision-language-action models via few-shot demonstrations

Chancharik Mitra, Yusen Luo, Raj Saravanan, Dantong Niu, Anirudh Pai, Jesse Thomason, Trevor Darrell, Abrar Anwar, Deva Ramanan, and Roei Herzig. Mech- anistic finetuning of vision-language-action models via few-shot demonstrations. arXiv preprint arXiv:2511.22697, 2025

work page arXiv 2025
[22]

Saliency-aware quantized imitation learning for efficient robotic control

Seongmin Park, Hyungmin Kim, Sangwoo Kim, Wonseok Jeon, Juyoung Yang, Byeongwook Jeon, Yoonseon Oh, and Jungwook Choi. Saliency-aware quantized imitation learning for efficient robotic control. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13140–13150, 2025. Revisiting Parameter Redundancy in VLA 17

2025
[23]

Action-aware dynamic pruning for efficient vision-language-action manipulation

Xiaohuan Pei, Yuxing Chen, Siyu Xu, Yunke Wang, Yuheng Shi, and Chang Xu. Action-aware dynamic pruning for efficient vision-language-action manipulation. arXiv preprint arXiv:2509.22093, 2025

work page arXiv 2025
[24]

Vision-language-action (vla) models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769, 2025

Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision-language-action (vla) models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769, 2025

work page arXiv 2025
[25]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and effi- cient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025

Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, and Tao Chen. Think twice, act once: Token-aware compression and ac- tion reuse for efficient inference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025

work page arXiv 2025
[28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Bitvla: 1-bit vision-language- action models for robotics manipulation.arXiv preprint arXiv:2506.07530, 2025

Hongyu Wang, Chuyan Xiong, Ruiping Wang, and Xilin Chen. Bitvla: 1- bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530, 2025

work page arXiv 2025
[30]

Tinyvla: Towards fast, data- efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data- efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

2025
[31]

Vla-cache: Efficient vision-language-action ma- nipulation via adaptive token caching.arXiv preprint arXiv:2502.02175, 2025a

Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla- cache: Efficient vision-language-action manipulation via adaptive token caching. arXiv preprint arXiv:2502.02175, 2025

work page arXiv 2025
[32]

Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation.arXiv preprint arXiv:2512.07472, 2025

Siyu Xu, Zijian Wang, Yunke Wang, Chenghao Xia, Tao Huang, and Chang Xu. Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation.arXiv preprint arXiv:2512.07472, 2025

work page arXiv 2025
[33]

Qvla: Not all channels are equal in vision-language-action model’s quantization.arXiv preprint arXiv:2602.03782, 2026

Yuhao Xu, Yantai Yang, Zhenyang Fan, Yufan Liu, Yuming Li, Bing Li, and Zhipeng Zhang. Qvla: Not all channels are equal in vision-language-action model’s quantization.arXiv preprint arXiv:2602.03782, 2026

work page arXiv 2026
[34]

Efficientvla:Training-freeaccelerationand compression for vision-language-action models.arXiv preprint arXiv:2506.10100, 2025

Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang,ChuanWen,andLinfengZhang. Efficientvla:Training-freeaccelerationand compression for vision-language-action models.arXiv preprint arXiv:2506.10100, 2025

work page arXiv 2025
[35]

A survey on efficient vision-language-action models, 2025

Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Lianli Gao, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. A survey on efficient vision- language-action models.arXiv preprint arXiv:2510.24795, 2025

work page arXiv 2025
[36]

Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution.Advances in Neural Information Processing Systems, 37:56619–56643, 2024

Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution.Advances in Neural Information Processing Systems, 37:56619–56643, 2024

2024
[37]

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Re- 18 F. Zhang et al. visiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation

Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation. arXiv preprint arXiv:2503.20384, 2025

work page arXiv 2025
[39]

strong compensation

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. Revisiting Parameter Redundancy in VLA 19 A Detailed Model Composition and Parameter ...

2023

[1] [1]

Fluctuation-based adaptive structured pruning for large language models

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Fluctuation-based adaptive structured pruning for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10865–10873, 2024

2024

[2] [2]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang,DanielSalz,MaximNeumann,IbrahimAlabdulmohsin,MichaelTschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

Yuxuan Chen and Xiao Li. Rlrc: Reinforcement learning-based recovery for com- pressed vision-language-action models.arXiv preprint arXiv:2506.17639, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowd- hery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Sqap-vla: A synergistic quantization-aware pruning framework for high-performance vision- language-action models.arXiv preprint arXiv:2509.09090, 2025

Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, and Huanrui Yang. Sqap-vla: A synergistic quantization-aware pruning framework for high-performance vision- language-action models.arXiv preprint arXiv:2509.09090, 2025

work page arXiv 2025

[7] [7]

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Weifan Guan, Qinghao Hu, Aosheng Li, and Jian Cheng. Efficient vision-language- action models for embodied manipulation: A systematic survey.arXiv preprint arXiv:2510.17111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

Asher J Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Ma- jumdar. Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

work page arXiv 2025

[9] [9]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fu- sai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Per...

2025

[10] [10]

Don’t run with scissors: Pruning breaks vla models but they can be recovered.arXiv preprint arXiv:2510.08464, 2025

Jason Jabbour, Dong-Ki Kim, Max Smith, Jay Patrikar, Radhika Ghosal, Youhui Wang, Ali Agha, Vijay Janapa Reddi, and Shayegan Omidshafiei. Don’t run with scissors: Pruning breaks vla models but they can be recovered.arXiv preprint arXiv:2510.08464, 2025

work page arXiv 2025

[11] [11]

Prismatic vlms: Investigating the design space of visually- conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kol- lar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually- conditioned language models. InForty-first International Conference on Machine Learning, 2024

2024

[12] [12]

Vision-language-action models for robotics: A review towards real-world applica- tions.IEEE Access, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applica- tions.IEEE Access, 2025

2025

[13] [13]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipula- tion.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

Wenhao Li, Xiu Su, Yichao Cao, Hongyan Xu, Xiaobo Xia, Shan You, Yi Chen, and Chang Xu. Vla-attc: Adaptive test-time compute for vla models with relative action critic model.arXiv preprint arXiv:2605.01194, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Sp-vla:Ajointmodelschedulingandtoken pruning approach for vla model acceleration.arXiv preprint arXiv:2506.12723, 2025

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, ShutaoXia,ZhiWang,andWenwuZhu. Sp-vla:Ajointmodelschedulingandtoken pruning approach for vla model acceleration.arXiv preprint arXiv:2506.12723, 2025

work page arXiv 2025

[17] [17]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Pe- ter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023

[18] [18]

Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Ad- vances in Neural Information Processing Systems, 37:40085–40110, 2024

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation.Ad- vances in Neural Information Processing Systems, 37:40085–40110, 2024

2024

[19] [19]

Llm-pruner: On the structural pruning of large language models.Advances in neural information processing sys- tems, 36:21702–21720, 2023

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing sys- tems, 36:21702–21720, 2023

2023

[20] [20]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Mech- anistic finetuning of vision-language-action models via few-shot demonstrations

Chancharik Mitra, Yusen Luo, Raj Saravanan, Dantong Niu, Anirudh Pai, Jesse Thomason, Trevor Darrell, Abrar Anwar, Deva Ramanan, and Roei Herzig. Mech- anistic finetuning of vision-language-action models via few-shot demonstrations. arXiv preprint arXiv:2511.22697, 2025

work page arXiv 2025

[22] [22]

Saliency-aware quantized imitation learning for efficient robotic control

Seongmin Park, Hyungmin Kim, Sangwoo Kim, Wonseok Jeon, Juyoung Yang, Byeongwook Jeon, Yoonseon Oh, and Jungwook Choi. Saliency-aware quantized imitation learning for efficient robotic control. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13140–13150, 2025. Revisiting Parameter Redundancy in VLA 17

2025

[23] [23]

Action-aware dynamic pruning for efficient vision-language-action manipulation

Xiaohuan Pei, Yuxing Chen, Siyu Xu, Yunke Wang, Yuheng Shi, and Chang Xu. Action-aware dynamic pruning for efficient vision-language-action manipulation. arXiv preprint arXiv:2509.22093, 2025

work page arXiv 2025

[24] [24]

Vision-language-action (vla) models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769, 2025

Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision-language-action (vla) models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769, 2025

work page arXiv 2025

[25] [25]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and effi- cient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025

Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, and Tao Chen. Think twice, act once: Token-aware compression and ac- tion reuse for efficient inference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025

work page arXiv 2025

[28] [28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Bitvla: 1-bit vision-language- action models for robotics manipulation.arXiv preprint arXiv:2506.07530, 2025

Hongyu Wang, Chuyan Xiong, Ruiping Wang, and Xilin Chen. Bitvla: 1- bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530, 2025

work page arXiv 2025

[30] [30]

Tinyvla: Towards fast, data- efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data- efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

2025

[31] [31]

Vla-cache: Efficient vision-language-action ma- nipulation via adaptive token caching.arXiv preprint arXiv:2502.02175, 2025a

Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla- cache: Efficient vision-language-action manipulation via adaptive token caching. arXiv preprint arXiv:2502.02175, 2025

work page arXiv 2025

[32] [32]

Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation.arXiv preprint arXiv:2512.07472, 2025

Siyu Xu, Zijian Wang, Yunke Wang, Chenghao Xia, Tao Huang, and Chang Xu. Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation.arXiv preprint arXiv:2512.07472, 2025

work page arXiv 2025

[33] [33]

Qvla: Not all channels are equal in vision-language-action model’s quantization.arXiv preprint arXiv:2602.03782, 2026

Yuhao Xu, Yantai Yang, Zhenyang Fan, Yufan Liu, Yuming Li, Bing Li, and Zhipeng Zhang. Qvla: Not all channels are equal in vision-language-action model’s quantization.arXiv preprint arXiv:2602.03782, 2026

work page arXiv 2026

[34] [34]

Efficientvla:Training-freeaccelerationand compression for vision-language-action models.arXiv preprint arXiv:2506.10100, 2025

Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang,ChuanWen,andLinfengZhang. Efficientvla:Training-freeaccelerationand compression for vision-language-action models.arXiv preprint arXiv:2506.10100, 2025

work page arXiv 2025

[35] [35]

A survey on efficient vision-language-action models, 2025

Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Lianli Gao, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. A survey on efficient vision- language-action models.arXiv preprint arXiv:2510.24795, 2025

work page arXiv 2025

[36] [36]

Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution.Advances in Neural Information Processing Systems, 37:56619–56643, 2024

Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution.Advances in Neural Information Processing Systems, 37:56619–56643, 2024

2024

[37] [37]

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Re- 18 F. Zhang et al. visiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation

Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manipulation. arXiv preprint arXiv:2503.20384, 2025

work page arXiv 2025

[39] [39]

strong compensation

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. Revisiting Parameter Redundancy in VLA 19 A Detailed Model Composition and Parameter ...

2023