pith. machine review for the scientific record. sign in

arxiv: 2511.18082 · v3 · submitted 2025-11-22 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-17 06:03 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords Vision-Language-Action modelsmodel distillationefficient roboticsaction predictiondynamic routinggraph encapsulationembodied AIknowledge transfer
0
0 comments X

The pith

ActDistill transfers action prediction from full-scale VLA models to lightweight students via graph-encapsulated hierarchies and dynamic routing for over 50 percent less computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ActDistill as a distillation method that moves action prediction skills from any trained vision-language-action model into a smaller student version. It wraps the teacher's action process inside a graph to show how predictions build step by step across layers. The student then uses a dynamic router to pick only the computation paths needed for each action, trained with supervision that follows the same graph structure. Once trained, the graph pieces are removed so the student runs a slim set of layers at inference time. This action-first approach yields models that match or exceed the original performance on robotic tasks while cutting computation and latency.

Core claim

ActDistill is a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. It employs a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction, equips the student model with a dynamic router that adaptively selects computation paths based on action prediction demands, and applies hierarchical graph-informed supervision to ensure smooth evolution, so that at inference the graph-related auxiliary components can be removed and the student executes only dynamically routed layers to predict high-precision actions with minimal computation and 1

What carries the argument

Graph-structured encapsulation that models the hierarchical evolution of action prediction, paired with a dynamic router and hierarchical supervision that together allow auxiliary components to be dropped at inference while preserving action precision.

If this is right

  • Lightweight VLA models reach comparable or superior performance to full-scale models on embodied benchmarks.
  • Computation drops by more than 50 percent and inference runs up to 1.67 times faster.
  • The framework works with any existing VLA model as a general method for action-oriented compression.
  • Action priors guide the transfer instead of vision-language correlations alone, producing efficiency gains specific to robotic manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar graph encapsulation of sequential decisions could be tested on non-robotics tasks such as step-by-step planning in language models.
  • The dynamic router might let a single student model handle varying task difficulties across different robot hardware without separate training runs.
  • If the router learns to cover all cases reliably, future versions could prune even more layers while keeping the same accuracy.
  • Applying the same action-guided distillation across multiple robots or environments would show whether the hierarchy generalizes beyond the original training benchmarks.

Load-bearing premise

The graph-structured encapsulation accurately models the hierarchical evolution of action prediction and the dynamic router plus hierarchical supervision allow removal of auxiliary components at inference without degrading action precision.

What would settle it

Training a lightweight student model on the same VLA teacher without any graph encapsulation or dynamic router and then measuring whether it still reaches comparable performance with over 50 percent computation reduction on the same embodied benchmarks would falsify the necessity of the action-guided graph machinery.

Figures

Figures reproduced from arXiv: 2511.18082 by Fengling Li, Guoli Yang, Hengtao Shen, Lei Zhu, Tianshi Wang, Wencheng Ye.

Figure 1
Figure 1. Figure 1: Comparison between previous efficiency VLA strategies [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our ActDistill framework. tions. Given visual input v (e.g., RGB frames or multi-view perception) and a language instruction l, the model outputs an action vector a encoding control parameters such as end￾effector pose and gripper state. A typical VLA architec￾ture includes a visual encoder Ev, a language encoder El , a multimodal backbone B, and an action head H. For an input pair (v,l), the e… view at source ↗
Figure 3
Figure 3. Figure 3: Performance-efficiency trade-off across different layer [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of layer-wise activation frequency across [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Final-layer attention heatmaps comparing the original [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional heatmap visualizations comparing ActDistill and the teacher model. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ActDistill, an action-guided self-derived distillation framework for compressing Vision-Language-Action (VLA) models. It uses a well-trained VLA as teacher, introduces graph-structured encapsulation to model the hierarchical evolution of action prediction, equips the student with a dynamic router for adaptive path selection under hierarchical graph-informed supervision, and removes all graph-related auxiliaries at inference to achieve efficient action prediction. Experiments on embodied benchmarks are reported to show comparable or superior performance to full-scale VLA models with over 50% computation reduction and up to 1.67x speedup.

Significance. If the central efficiency claims are substantiated, the work could meaningfully advance practical deployment of VLA models in resource-limited robotic settings by offering an action-oriented compression paradigm that goes beyond vision-language focused methods.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments: benchmark gains are stated (comparable/superior performance, >50% compute reduction, 1.67x speedup) but no quantitative baselines, standard-deviation or variance measures, or ablation results on the graph encapsulation and dynamic router are supplied. This leaves the performance claim only partially supported and makes it difficult to isolate the contribution of the new structural elements.
  2. [Method] Method (graph-structured encapsulation and dynamic router): the efficiency claim rests on the assumption that hierarchical supervision during training allows the dynamic router (conditioned only on remaining layers) to fully encode the action-prediction hierarchy, so that graph auxiliaries can be dropped at inference without precision loss. No ablation or analysis tests this transfer; if the router was trained to rely on auxiliary graph features, action precision may degrade once they are removed.
minor comments (2)
  1. [Method] Clarify in the method section how the graph-structured encapsulation is formally defined and how it differs from existing hierarchical or tree-based action models in the VLA literature.
  2. [Introduction] Add a short related-work paragraph contrasting ActDistill with prior distillation and routing techniques for VLMs or VLAs to better position the novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications and outlining revisions to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments: benchmark gains are stated (comparable/superior performance, >50% compute reduction, 1.67x speedup) but no quantitative baselines, standard-deviation or variance measures, or ablation results on the graph encapsulation and dynamic router are supplied. This leaves the performance claim only partially supported and makes it difficult to isolate the contribution of the new structural elements.

    Authors: We agree that including quantitative details, variance measures, and ablations would enhance the clarity and rigor of our experimental claims. The full paper presents comparisons on embodied benchmarks, but to directly address this, we will revise the abstract and experiments section to include specific baseline metrics with standard deviations from multiple runs and dedicated ablation studies on the graph encapsulation and dynamic router components. revision: yes

  2. Referee: [Method] Method (graph-structured encapsulation and dynamic router): the efficiency claim rests on the assumption that hierarchical supervision during training allows the dynamic router (conditioned only on remaining layers) to fully encode the action-prediction hierarchy, so that graph auxiliaries can be dropped at inference without precision loss. No ablation or analysis tests this transfer; if the router was trained to rely on auxiliary graph features, action precision may degrade once they are removed.

    Authors: The hierarchical graph-informed supervision is designed to guide the student model during training such that the dynamic router learns to select paths based on action demands without needing the auxiliaries at inference. We acknowledge the value of explicit verification for this transfer. In the revised manuscript, we will add an analysis or ablation study comparing the student's performance with and without the graph-related components at inference time to substantiate that no precision loss occurs. revision: yes

Circularity Check

0 steps flagged

No circularity: new distillation architecture with empirical validation

full rationale

The paper introduces ActDistill as an extension of teacher-student distillation, adding graph-structured encapsulation to model action prediction hierarchies, a dynamic router for path selection, and hierarchical supervision. These are presented as novel design choices rather than derived from prior equations or self-citations within the work. The central efficiency claim—that auxiliaries can be removed at inference while preserving precision—is supported by benchmark experiments showing >50% computation reduction and 1.67x speedup, not by any reduction of reported metrics to quantities fitted inside the same paper. No load-bearing step equates a prediction to its own input by construction, and the framework remains self-contained against external VLA benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the availability of a capable pre-trained VLA teacher and on the premise that action-oriented graph supervision transfers without loss after auxiliary components are stripped at inference.

axioms (1)
  • domain assumption A well-trained VLA model exists that can serve as an effective teacher whose action predictions contain transferable hierarchical structure.
    Invoked when the paper states a well-trained VLA model is employed as teacher.
invented entities (2)
  • Graph-structured encapsulation no independent evidence
    purpose: Explicitly model the hierarchical evolution of action prediction
    New structural wrapper introduced to guide distillation.
  • Dynamic router no independent evidence
    purpose: Adaptively select computation paths based on action prediction demands
    New component that uses graph-informed supervision to decide active layers.

pith-pipeline@v0.9.0 · 5549 in / 1301 out tokens · 47502 ms · 2026-05-17T06:03:33.305269+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 conditional novelty 6.0

    FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InProceedings of the Advances in Neural Information Processing Systems, pages 23716–23736, 2022. 1

  2. [2]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1

  3. [3]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 2

  4. [4]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 2

  5. [5]

    Edgevla: Efficient vision- language-action models.arXiv preprint arXiv:2507.14049,

    Paweł Budzianowski, Wesley Maa, Matthew Freed, Jingxi- ang Mo, Winston Hsiao, Aaron Xie, Tomasz Młoduchowski, Viraj Tipnis, and Benjamin Bolte. Edgevla: Efficient vision- language-action models.arXiv preprint arXiv:2507.14049,

  6. [6]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 2

  7. [7]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InProceedings of the Euro- pean Conference on Computer Vision, pages 19–35, 2024. 6

  8. [8]

    Efficient vision-language-action models for em- bodied manipulation: A systematic survey.arXiv preprint arXiv:2510.17111, 2025

    Weifan Guan, Qinghao Hu, Aosheng Li, and Jian Cheng. Efficient vision-language-action models for em- bodied manipulation: A systematic survey.arXiv preprint arXiv:2510.17111, 2025. 1

  9. [9]

    The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning.arXiv preprint arXiv:2509.12594, 2025b

    Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, and Xianpeng Lang. The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable to- ken pruning.arXiv preprint arXiv:2509.12594, 2025. 2

  10. [10]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2, 6

  11. [11]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674,

  12. [12]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650,

  13. [13]

    Vision-language foun- dation models as effective robot imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision-language foun- dation models as effective robot imitators. InProceedings of the International Conference on Learning Representations, pages 1–12, 2024. 1

  14. [14]

    Evaluating real- world robot manipulation policies in simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Oier Mees, Karl Pertsch, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real- world robot manipulation policies in simulation. InProceed- ings of the Conference on Robot Learning, pages 3705–3728,

  15. [15]

    Libero: Benchmarking knowl- edge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning. InProceedings of the Advances in Neural Information Processing Systems, pages 44776–44791, 2023. 6

  16. [16]

    Robomamba: Efficient vision-language-action model for robotic reasoning and ma- nipulation

    Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xi- aoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yan- dong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic reasoning and ma- nipulation. InProceedings of the Advances in Neural Infor- mation Processing Systems, pages 40085–40110, 2024. 1

  17. [17]

    VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipula- tion with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025. 2

  18. [18]

    A Survey on Vision-Language-Action Models for Embodied AI

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024. 1

  19. [19]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InProceedings of the IEEE International Conference on Robotics and Automation, pages 6892–6903, 2...

  20. [20]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 2

  21. [21]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Ar- actingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for afford- able and efficient robotics.arXiv preprint arXiv:2506.01844,

  22. [22]

    Ceed-vla: Consis- tency vision-language-action model with early-exit decod- ing.arXiv preprint arXiv:2506.13725, 2025

    Wenxuan Song, Jiayi Chen, Pengxiang Ding, Yuxin Huang, Han Zhao, Donglin Wang, and Haoang Li. Ceed-vla: Consis- tency vision-language-action model with early-exit decod- ing.arXiv preprint arXiv:2506.13725, 2025. 1

  23. [23]

    Think twice, act once: Token-aware compression and action reuse for efficient in- ference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025

    Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, and Tao Chen. Think twice, act once: Token-aware compression and action reuse for efficient in- ference in vision-language-action models.arXiv preprint arXiv:2505.21200, 2025. 2

  24. [24]

    Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025. 1

  25. [25]

    Tinyvla: To- ward fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Let- ters, 10(4):3988–3995, 2025

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. Tinyvla: To- ward fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Let- ters, 10(4):3988–3995, 2025. 2

  26. [26]

    Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025

    Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025. 1, 2, 6

  27. [27]

    Efficientvla: Training-free acceleration and com- pression for vision-language-action models.arXiv preprint arXiv:2506.10100, 2025

    Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, and Linfeng Zhang. Efficientvla: Training-free acceleration and com- pression for vision-language-action models.arXiv preprint arXiv:2506.10100, 2025. 1, 2, 6

  28. [28]

    Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution

    Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution. InProceedings of the Advances in Neural Information Processing Systems, pages 56619– 56643, 2024. 1, 2

  29. [29]

    Pure vision language action (vla) models: A comprehensive sur- vey.arXiv preprint arXiv:2509.19012, 2025

    Dapeng Zhang, Jin Sun, Chenghui Hu, Xiaoyan Wu, Zhen- long Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Pure vision language action (vla) models: A comprehensive sur- vey.arXiv preprint arXiv:2509.19012, 2025. 1

  30. [30]

    Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5625–5644, 2024

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5625–5644, 2024. 1

  31. [31]

    Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manip- ulation.arXiv preprint arXiv:2503.20384, 2025

    Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manip- ulation.arXiv preprint arXiv:2503.20384, 2025. 1, 2, 6

  32. [32]

    Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. InProceedings of the International Conference on Machine Learning, pages 1–18, 2025. 6 ActDistill: General Action-Guided Se...