ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

Chenglei Wu; Huanan Liu; Jiajun Fan; Kangye Ji; Shiyu Qin; Shu-Tao Xia; Ye Li; Yuan Meng; Yuansong Wang; Zhi Wang

arxiv: 2605.29438 · v1 · pith:H7T6C77Pnew · submitted 2026-05-28 · 💻 cs.RO

ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

Ye Li , Huanan Liu , Kangye Ji , Yuan Meng , Jiajun Fan , Yuansong Wang , Shiyu Qin , Chenglei Wu

show 2 more authors

Shu-Tao Xia Zhi Wang

This is my paper

Pith reviewed 2026-06-29 07:14 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelsdynamic compute schedulingrobotic manipulationinference accelerationphase-adaptive inferencereal-time robot controlplug-in frameworktemporal reuse

0 comments

The pith

ElegantVLA accelerates VLA models by scheduling full recomputation only during goal-sensitive phases of robot motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language-action models waste computation by running full perception, language, and action refinement at every control step. It proposes a plug-in scheduler that detects stable phases using representation similarity, motion cues, and episode progress, then drops to reuse modes for vision-language reasoning and action denoising. This keeps the base model unchanged while raising control frequency and cutting compute. A sympathetic reader would care because current VLA systems run too slowly for reliable real-time physical manipulation.

Core claim

ElegantVLA is a phase-adaptive inference framework that introduces a lightweight scheduler to coordinate five-level vision-LLM compute modes (full recomputation to multi-step temporal reuse) and three-level action denoising modes. The scheduler jointly observes temporal representation similarity, robot-motion cues, and episode progress to allocate computation only when needed, enabling acceleration of modern VLA pipelines that contain explicit action-generation modules without any modification or retraining of the base model.

What carries the argument

The lightweight scheduler that selects vision-LLM and action-head compute modes from temporal representation similarity, robot-motion cues, and episode progress.

If this is right

Up to 2.55x speedup on GR00T and 3.77x on CogACT benchmarks.
On six real-world GR00T tasks, computation drops by 2.18x and control frequency rises from 13.8 Hz to 26.3 Hz.
The method works as a plug-in for any VLA with explicit action-generation modules and requires no base-model changes.
Coordination of vision-language and action decisions preserves performance while raising efficiency.
The approach applies to sequential embodied control where reasoning demand varies across steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stability-based scheduling could be tested on other sequential multimodal models outside robotics.
If the scheduler generalizes, it might combine with quantization or distillation for further gains.
Long-horizon tasks with infrequent goal changes would likely see the largest frequency increases.
Failure modes would appear first on tasks where visual or motion similarity misleads the reuse decision.

Load-bearing premise

The scheduler can correctly decide when to reuse prior computation versus run full recomputation using only those three signals, without lowering task success rates.

What would settle it

Apply the scheduler to the same GR00T or CogACT evaluation tasks and measure whether success rate falls below the unmodified base model.

Figures

Figures reproduced from arXiv: 2605.29438 by Chenglei Wu, Huanan Liu, Jiajun Fan, Kangye Ji, Shiyu Qin, Shu-Tao Xia, Ye Li, Yuan Meng, Yuansong Wang, Zhi Wang.

**Figure 1.** Figure 1: Observation motivating phase-adaptive cache scheduling. VLA execution exhibits temporally varying inference demand rather than uniform step-wise complexity. In this draweropening and apple-grasping rollout, (a,b) show pairwise CKA similarities of the first and final LLMlayer representations, (c) shows robot motion speeds, and (d) shows the rollout. Similar changes appear in both final-layer and first-lay… view at source ↗

**Figure 2.** Figure 2: Overview of ElegantVLA. ElegantVLA learns when to think during VLA control by attaching a phase-adaptive scheduler to a frozen base policy. At each step, the scheduler uses temporal representation similarity, robot-motion signals, and episode progress to select a joint compute action for the Vision-LLM backbone and the action head. It allocates full recomputation to phase-sensitive moments and reuses cach… view at source ↗

**Figure 3.** Figure 3: Qualitative visualization of faster real-world execution. On real-world pineapple-bun pickup, ElegantVLA preserves the pick-and-place phases while completing the task 2.31× faster than full-computation GR00T [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative visualization of faster simulated execution. On Google Robot drawer opening and WidowX carrot placement, ElegantVLA completes the rollouts faster while preserving the key manipulation phases. space: ElegantVLA can refresh high-level visual-language reasoning while reusing low-level action refinement, or reuse the Vision-LLM backbone while recomputing the action head when precise control is nee… view at source ↗

**Figure 5.** Figure 5: Real-world experimental platform. The physical experiments use a Franka Research 3 setup with stationary tabletop manipulation on the left and conveyor-belt pickup on the right [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world rollout on Phone Stand. ElegantVLA preserves precise placement and release on a stationary task, indicating that acceleration does not degrade fine-grained execution [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Real-world rollout on Pen Holder. Adaptive reuse keeps the contact-sensitive insertion phase, providing a qualitative check that temporal reuse does not remove precision-critical behavior. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Real-world rollout on Stack Bowls. The sequence preserves grasping, alignment, and release, showing that acceleration does not collapse the manipulation into a coarse action sequence [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Real-world rollout on Toast. The conveyor-belt setting tests whether lower latency lets the policy keep correcting the end-effector motion around a moving target [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Real-world rollout on Chocolate Bun. This additional moving-target rollout shows the same responsiveness regime as the main pineapple-bun example, where timely corrections are important. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: illustrates this two-level adaptation on a representative rollout. The remaining simulation rollouts in Figs. 12–23 show the same principle across Google Robot and WidowX Robot tasks. Across different task demands, ElegantVLA uses lightweight inference during easy phases while preserving computation for semantically or physically complex phases. This behavior is the key qualitative evidence that our accel… view at source ↗

**Figure 12.** Figure 12: Google Robot simulation rollout on Move Near. The spatial transport task shows how the scheduler changes lightweight modes as high-level task complexity changes [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Google Robot simulation rollout on Open Drawer. The drawer manipulation task illustrates that ElegantVLA adapts computation to semantic state changes rather than following one fixed reuse pattern. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Google Robot simulation rollout on Pick Coke Can. The object pickup task shows how high-level object state and low-level grasp control jointly affect the scheduler’s lightweight choices [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Google Robot simulation rollout on Pick Object. The generic pickup task provides another example where ElegantVLA adjusts computation as semantic and control complexity vary over time. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Google Robot simulation rollout on Close Drawer. Closing the drawer requires the scheduler to preserve computation around drawer-state changes while allowing reuse during stable motion [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: WidowX Robot simulation rollout on Carrot on Plate. The placement task tests whether the two-level scheduler remains adaptive under the WidowX visual domain. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: WidowX Robot simulation rollout on Close Drawer. The contact-rich drawer interaction shows how low-level control complexity affects action-head reuse [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: WidowX Robot simulation rollout on Open Drawer. Opening the drawer provides another contact-sensitive example where the scheduler should reduce aggressive reuse around physical interaction. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

**Figure 20.** Figure 20: WidowX Robot simulation rollout on Put Eggplant in Basket. The final placement changes the required object configuration, so the scheduler preserves computation during the more complex control phase [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 21.** Figure 21: WidowX Robot simulation rollout on Put Eggplant in Sink. This placement task further shows task-complexity-adaptive computation under changing final object configurations. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗

**Figure 22.** Figure 22: WidowX Robot simulation rollout on Spoon on Towel. The scheduler keeps lightweight inference for stable motion while preserving computation near placement and release [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗

**Figure 23.** Figure 23: WidowX Robot simulation rollout on Stack Cube. The stacking task provides an additional contact-sensitive case where low-level alignment and release require less aggressive actionhead reuse. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_23.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ElegantVLA adds a coordinated scheduler for reusing compute in VLA perception and action stages during stable periods, with reported speedups, but the abstract supplies no success-rate or scheduler-accuracy numbers.

read the letter

This paper's main point is a lightweight scheduler that picks compute levels for the vision-LLM (five options from full to reuse) and action denoising (three options) by watching temporal representation similarity, robot motion, and episode progress. It aims to reuse prior work when things are stable, like human control does.

The new element is this coordinated scheduling across perception-language and action parts as a plug-in framework for existing VLA models like GR00T and CogACT. The reported results include speedups of 2.55x and 3.77x in simulation and a real-world doubling of control frequency from 13.8 Hz to 26.3 Hz with 2.18x less computation on six tasks.

The experiments give concrete numbers on efficiency gains, which is the practical strength here.

The soft spot is the missing information on task success rates and scheduler reliability. The abstract does not mention any deltas in success or how often the scheduler might skip needed computation. The concern that the cues might not catch goal-sensitive stages accurately is worth checking, because that would affect whether the speedups come for free. Details on scheduler training are also absent from the summary.

This is for roboticists working on efficient VLA deployment. It has a clear idea and measurable gains, so it deserves peer review to examine the full methods and results, particularly around performance preservation.

Referee Report

3 major / 2 minor

Summary. The paper proposes ElegantVLA, a plug-in phase-adaptive inference framework for Vision-Language-Action (VLA) models. It introduces a lightweight scheduler that uses temporal representation similarity, robot-motion cues, and episode progress to dynamically select among five-level Vision-LLM compute modes (full recomputation to multi-step temporal reuse) and three-level action denoising modes, without modifying or retraining the base model. The central claims are speedups of up to 2.55x on GR00T and 3.77x on CogACT, plus 2.18x compute reduction and control frequency increase from 13.8 Hz to 26.3 Hz on six real-world GR00T tasks.

Significance. If the scheduler maintains task success rates while delivering the reported efficiency gains, this could meaningfully advance real-time robotic control by addressing the uniform high compute of VLAs. The plug-in design, coordination of perception and action scheduling, and validation across simulation models plus real-world tasks are concrete strengths that would support broader adoption if the no-degradation claim is substantiated.

major comments (3)

[§4] §4 (Experiments on real-world tasks): success rates or task completion metrics comparing ElegantVLA to the GR00T baseline on the six real-world tasks are not reported; only aggregate compute reduction and frequency are given. This directly bears on the central claim that the scheduler preserves performance without base-model changes.
[§3.1] §3.1 (lightweight scheduler): the exact similarity metric, decision thresholds, and combination rule for temporal representation similarity, robot-motion cues, and episode progress are not specified. Without these, the accuracy of recomputation detection cannot be assessed or reproduced, which is load-bearing for the plug-in acceleration claim.
[§3.2] §3.2 (Vision-LLM and denoising modes): the implementation details of the five-level vision-LLM reuse and three-level denoising reuse (e.g., how intermediate states are cached and reused) are insufficient to verify that output quality is preserved in goal-sensitive stages, undermining the assertion that no base-model retraining is needed.

minor comments (2)

[Abstract] Abstract: the reported speedups are given as 'up to' values without accompanying averages, variance, or conditions under which they were measured.
[§3] Notation in §3: the five-level and three-level modes would benefit from an explicit table mapping cue conditions to selected modes for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below and will make the indicated revisions to improve the manuscript's clarity, reproducibility, and substantiation of claims.

read point-by-point responses

Referee: [§4] §4 (Experiments on real-world tasks): success rates or task completion metrics comparing ElegantVLA to the GR00T baseline on the six real-world tasks are not reported; only aggregate compute reduction and frequency are given. This directly bears on the central claim that the scheduler preserves performance without base-model changes.

Authors: The referee correctly notes that §4 reports only the aggregate 2.18× compute reduction and frequency increase (13.8 Hz to 26.3 Hz) for the six real-world GR00T tasks. Although the scheduler is explicitly designed to allocate full computation to goal-sensitive stages, and simulation results on GR00T and CogACT show no degradation, we agree that real-world success rates are required to fully support the performance-preservation claim. In the revised manuscript we will add these metrics (task success rates and completion percentages versus the GR00T baseline) to §4, using data from the same experimental runs. revision: yes
Referee: [§3.1] §3.1 (lightweight scheduler): the exact similarity metric, decision thresholds, and combination rule for temporal representation similarity, robot-motion cues, and episode progress are not specified. Without these, the accuracy of recomputation detection cannot be assessed or reproduced, which is load-bearing for the plug-in acceleration claim.

Authors: We accept this criticism. The current text describes the three input signals but omits the precise implementation. In the revision we will specify: (i) cosine similarity on the concatenated vision-language features, (ii) concrete thresholds (e.g., similarity > 0.92 triggers partial reuse), and (iii) the combination rule (a weighted sum with coefficients 0.5/0.3/0.2 followed by a threshold comparison). Pseudocode and the validation procedure used to set the thresholds will also be added to §3.1. revision: yes
Referee: [§3.2] §3.2 (Vision-LLM and denoising modes): the implementation details of the five-level vision-LLM reuse and three-level denoising reuse (e.g., how intermediate states are cached and reused) are insufficient to verify that output quality is preserved in goal-sensitive stages, undermining the assertion that no base-model retraining is needed.

Authors: We agree that the mode descriptions are currently high-level. The revised §3.2 will include: (i) explicit definitions of the five Vision-LLM levels (full forward pass down to k-step KV-cache reuse), (ii) the three denoising levels with details on which intermediate noise estimates are cached and how they are injected, and (iii) a diagram showing the cache-update logic that ensures full refinement occurs precisely when the scheduler detects goal-sensitive phases. These additions will make clear that all operations remain inference-only and require no base-model changes or retraining. revision: yes

Circularity Check

0 steps flagged

No circularity: claims are empirical measurements of a plug-in scheduler, not derivations reducing to inputs

full rationale

The paper introduces ElegantVLA as a plug-in inference framework with a lightweight scheduler using temporal similarity, motion cues, and progress to select compute modes. All central claims (2.55x/3.77x speedups, 2.18x real-world compute cut, frequency increase from 13.8 Hz to 26.3 Hz) are presented as direct experimental outcomes on GR00T and CogACT without any equations, fitted parameters renamed as predictions, or self-citation chains that define the result by construction. The scheduler's design is described descriptively and validated by reported measurements rather than tautological internal definitions. No load-bearing steps reduce to self-definition or fitted-input predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, background axioms, or invented entities beyond the scheduler itself; the scheduler is introduced as part of the contribution rather than derived from prior literature.

invented entities (1)

lightweight scheduler no independent evidence
purpose: jointly allocate computation across vision encoder, LLM, and action head
Introduced as the core new component that observes temporal similarity, motion cues, and episode progress

pith-pipeline@v0.9.1-grok · 5875 in / 1164 out tokens · 31700 ms · 2026-06-29T07:14:06.744077+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 25 canonical work pages · 12 internal anchors

[1]

Rt-2: Vision-language-action models trans- fer web knowledge to robotic control

Brianna Zitkovich, Tianhe Y u, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models trans- fer web knowledge to robotic control. In Conference on Robot Learning , pages 2165–2183. PMLR, 2023

2023
[2]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action ﬂow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Y aobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Y u Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Weifan Guan, Qinghao Hu, Aosheng Li, and Jian Cheng. Efﬁcient vision-language-action models for embodied manipulation: A systematic survey. arXiv preprint arXiv:2510.17111 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

arXiv preprint arXiv:2510.24795 , year =

Zhaoshu Y u, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Zheng Wang, Lianli Gao, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. A survey on efﬁcient vision-language-action models. arXiv preprint arXiv:2510.24795, 2025

work page arXiv 2025
[9]

Deer-vla: Dynamic inference of multimodal large language models for efﬁcient robot execution

Y ang Y ue, Y ulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-vla: Dynamic inference of multimodal large language models for efﬁcient robot execution. Advances in Neural Information Processing Systems , 37:56619–56643, 2024

2024
[10]

Quantization-aware imitation-learning for resource-efﬁcient robotic control

Seongmin Park, Hyungmin Kim, Wonseok Jeon, Juyoung Y ang, Byeongwook Jeon, Y oonseon Oh, and Jungwook Choi. Quantization-aware imitation-learning for resource-efﬁcient robotic control. arXiv preprint arXiv:2412.01034, 2024

work page arXiv 2024
[11]

arXiv preprint arXiv:2506.10100 (2025)

Y antai Y ang, Y uhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, and Linfeng Zhang. Efﬁcientvla: Training-free acceleration and compression for vision- language-action models. arXiv preprint arXiv:2506.10100, 2025

work page arXiv 2025
[12]

Dysl-vla: Efﬁcient vision-language-action model inference via dynamic-static layer-skipping for robot manipula- tion

Zebin Y ang, Yijiahao Qi, Tong Xie, Bo Y u, Shaoshan Liu, and Meng Li. Dysl-vla: Efﬁcient vision-language-action model inference via dynamic-static layer-skipping for robot manipula- tion. arXiv preprint arXiv:2602.22896, 2026

work page arXiv 2026
[13]

Dyq-vla: Temporal-dynamic-aware quantization for embodied vision-language-action models

Zihao Zheng, Hangyu Cao, Sicheng Tian, Jiayu Chen, Maoliang Li, Xinhao Sun, Hailong Zou, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, et al. Dyq-vla: Temporal-dynamic-aware quantization for embodied vision-language-action models. arXiv preprint arXiv:2603.07904 , 2026

work page arXiv 2026
[14]

Acˆ 2-vla: Action-context- aware adaptive computation in vision-language-action models for efﬁcient robotic manipula- tion

Wenda Y u, Tianshi Wang, Fengling Li, Jingjing Li, and Lei Zhu. Acˆ 2-vla: Action-context- aware adaptive computation in vision-language-action models for efﬁcient robotic manipula- tion. arXiv preprint arXiv:2601.19634, 2026

work page arXiv 2026
[15]

Prance: Joint token-optimization and structural channel-pruning for adaptive vit inference

Y e Li, Chen Tang, Y uan Meng, Jiajun Fan, Zenghao Chai, Xinzhu Ma, Zhi Wang, and Wenwu Zhu. Prance: Joint token-optimization and structural channel-pruning for adaptive vit inference. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2025. 10

2025
[16]

Vla-cache: Towards efﬁcient vision-language-action model via adaptive token caching in robotic manipu- lation

Siyu Xu, Y unke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla-cache: Towards efﬁcient vision-language-action model via adaptive token caching in robotic manipu- lation. arXiv e-prints, pages arXiv–2502, 2025

2025
[17]

Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration

Y e Li, Y uan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, and Wenwu Zhu. Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723, 2025

work page arXiv 2025
[18]

arXiv preprint arXiv:2505.21200 , year =

Xudong Tan, Y aoxin Y ang, Peng Y e, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, and Tao Chen. Think twice, act once: Token-aware compression and action reuse for efﬁcient inference in vision-language-action models. arXiv preprint arXiv:2505.21200, 2025

work page arXiv 2025
[19]

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

Hanzhen Wang, Jiaming Xu, Y ushun Xiang, Jiayi Pan, Y ongkang Zhou, Y ong-Lu Li, and Guohao Dai. Specprune-vla: Accelerating vision-language-action models via action-aware self-speculative pruning. arXiv preprint arXiv:2509.05614, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Action-aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

Xiaohuan Pei, Y uxing Chen, Siyu Xu, Y unke Wang, Y uheng Shi, and Chang Xu. Action- aware dynamic pruning for efﬁcient vision-language-action manipulation. arXiv preprint arXiv:2509.22093, 2025

work page arXiv 2025
[21]

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

Ziyan Liu, Y eqiu Chen, Hongyi Cai, Tao Lin, Shuo Y ang, Zheng Liu, and Bo Zhao. Vla-pruner: Temporal-aware dual-level visual token pruning for efﬁcient vision-language-action inference. arXiv preprint arXiv:2511.16449, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

arXiv preprint arXiv:2509.12594 , year =

Titong Jiang, Xuefeng Jiang, Y uan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Y ahui Liu, Sheng Sun, and Xianpeng Lang. The better you learn, the smarter you prune: To- wards efﬁcient vision-language-action models via differentiable token pruning. arXiv preprint arXiv:2509.12594, 2025

work page arXiv 2025
[23]

Ts-dp: Reinforcement speculative decoding for temporal adaptive diffusion policy acceleration

Y e Li, Jiahe Feng, Y uan Meng, Kangye Ji, Chen Tang, Xinwan Wen, Shutao Xia, Zhi Wang, and Wenwu Zhu. Ts-dp: Reinforcement speculative decoding for temporal adaptive diffusion policy acceleration. arXiv preprint arXiv:2512.15773, 2025

work page arXiv 2025
[24]

Block-wise Adaptive Caching for Accelerating Diffusion Policy

Kangye Ji, Y uan Meng, Hanyun Cui, Y e Li, Shengjia Hua, Lei Chen, and Zhi Wang. Block- wise adaptive caching for accelerating diffusion policy. arXiv preprint arXiv:2506.13456 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Accelerating vision-language-action model integrated with action chunking via parallel decoding

Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated with action chunking via parallel decoding. arXiv preprint arXiv:2503.02310, 2025

work page arXiv 2025
[26]

FASTER: Rethinking Real-Time Flow VLAs

Y uxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Y ang, Jinghua Hou, Junyi Li, Kaixin Ding, and Hengshuang Zhao. Faster: Rethinking real-time ﬂow vlas. arXiv preprint arXiv:2603.19199 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

An internal model for sensori- motor integration

Daniel M Wolpert, Zoubin Ghahramani, and Michael I Jordan. An internal model for sensori- motor integration. Science, 269(5232):1880–1882, 1995

1995
[28]

Signal-dependent noise determines motor plan- ning

Christopher M Harris and Daniel M Wolpert. Signal-dependent noise determines motor plan- ning. Nature, 394(6695):780–784, 1998

1998
[29]

Optimal feedback control as a theory of motor coor- dination

Emanuel Todorov and Michael I Jordan. Optimal feedback control as a theory of motor coor- dination. Nature neuroscience, 5(11):1226–1235, 2002

2002
[30]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Y u Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

A ddpg-based solution for optimal consensus of continuous-time linear multi-agent systems

Y e Li, ZhongXin Liu, Ge Lan, Malika Sader, and ZengQiang Chen. A ddpg-based solution for optimal consensus of continuous-time linear multi-agent systems. Science China Technologi- cal Sciences, 66(8):2441–2453, 2023

2023
[32]

A novel data-driven model-free synchro- nization protocol for discrete-time multi-agent systems via td3 based algorithm

Zhongxin Liu, Y e Li, Ge Lan, and Zengqiang Chen. A novel data-driven model-free synchro- nization protocol for discrete-time multi-agent systems via td3 based algorithm. Knowledge- Based Systems, 287:111430, 2024. 11

2024
[33]

Ttf-vla: Temporal token fusion via pixel-attention integration for vision- language-action models

Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, and Huiling Duan. Ttf-vla: Temporal token fusion via pixel-attention integration for vision- language-action models. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 40, pages 18452–18459, 2026

2026
[34]

Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning

Kangye Ji, Y uan Meng, Zhou Jianbo, Y e Li, Hanyun Cui, and Zhi Wang. Sparse actiongen: Accelerating diffusion policy with real-time pruning. arXiv preprint arXiv:2601.12894, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision , pages 19–35. Springer, 2024

2024
[36]

arXiv preprint arXiv:2503.20384 (2025)

Rongyu Zhang, Menghang Dong, Y uan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Dan Wang, Y uan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision lan- guage action model via mixture-of-layers for efﬁcient robot manipulation. arXiv preprint arXiv:2503.20384, 2025. 12 Limitations and Responsible Use This work is a technical study of infe...

work page arXiv 2025

[1] [1]

Rt-2: Vision-language-action models trans- fer web knowledge to robotic control

Brianna Zitkovich, Tianhe Y u, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models trans- fer web knowledge to robotic control. In Conference on Robot Learning , pages 2165–2183. PMLR, 2023

2023

[2] [2]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action ﬂow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Y aobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Y u Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Weifan Guan, Qinghao Hu, Aosheng Li, and Jian Cheng. Efﬁcient vision-language-action models for embodied manipulation: A systematic survey. arXiv preprint arXiv:2510.17111 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

arXiv preprint arXiv:2510.24795 , year =

Zhaoshu Y u, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Zheng Wang, Lianli Gao, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. A survey on efﬁcient vision-language-action models. arXiv preprint arXiv:2510.24795, 2025

work page arXiv 2025

[9] [9]

Deer-vla: Dynamic inference of multimodal large language models for efﬁcient robot execution

Y ang Y ue, Y ulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-vla: Dynamic inference of multimodal large language models for efﬁcient robot execution. Advances in Neural Information Processing Systems , 37:56619–56643, 2024

2024

[10] [10]

Quantization-aware imitation-learning for resource-efﬁcient robotic control

Seongmin Park, Hyungmin Kim, Wonseok Jeon, Juyoung Y ang, Byeongwook Jeon, Y oonseon Oh, and Jungwook Choi. Quantization-aware imitation-learning for resource-efﬁcient robotic control. arXiv preprint arXiv:2412.01034, 2024

work page arXiv 2024

[11] [11]

arXiv preprint arXiv:2506.10100 (2025)

Y antai Y ang, Y uhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, and Linfeng Zhang. Efﬁcientvla: Training-free acceleration and compression for vision- language-action models. arXiv preprint arXiv:2506.10100, 2025

work page arXiv 2025

[12] [12]

Dysl-vla: Efﬁcient vision-language-action model inference via dynamic-static layer-skipping for robot manipula- tion

Zebin Y ang, Yijiahao Qi, Tong Xie, Bo Y u, Shaoshan Liu, and Meng Li. Dysl-vla: Efﬁcient vision-language-action model inference via dynamic-static layer-skipping for robot manipula- tion. arXiv preprint arXiv:2602.22896, 2026

work page arXiv 2026

[13] [13]

Dyq-vla: Temporal-dynamic-aware quantization for embodied vision-language-action models

Zihao Zheng, Hangyu Cao, Sicheng Tian, Jiayu Chen, Maoliang Li, Xinhao Sun, Hailong Zou, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, et al. Dyq-vla: Temporal-dynamic-aware quantization for embodied vision-language-action models. arXiv preprint arXiv:2603.07904 , 2026

work page arXiv 2026

[14] [14]

Acˆ 2-vla: Action-context- aware adaptive computation in vision-language-action models for efﬁcient robotic manipula- tion

Wenda Y u, Tianshi Wang, Fengling Li, Jingjing Li, and Lei Zhu. Acˆ 2-vla: Action-context- aware adaptive computation in vision-language-action models for efﬁcient robotic manipula- tion. arXiv preprint arXiv:2601.19634, 2026

work page arXiv 2026

[15] [15]

Prance: Joint token-optimization and structural channel-pruning for adaptive vit inference

Y e Li, Chen Tang, Y uan Meng, Jiajun Fan, Zenghao Chai, Xinzhu Ma, Zhi Wang, and Wenwu Zhu. Prance: Joint token-optimization and structural channel-pruning for adaptive vit inference. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2025. 10

2025

[16] [16]

Vla-cache: Towards efﬁcient vision-language-action model via adaptive token caching in robotic manipu- lation

Siyu Xu, Y unke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla-cache: Towards efﬁcient vision-language-action model via adaptive token caching in robotic manipu- lation. arXiv e-prints, pages arXiv–2502, 2025

2025

[17] [17]

Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration

Y e Li, Y uan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, and Wenwu Zhu. Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723, 2025

work page arXiv 2025

[18] [18]

arXiv preprint arXiv:2505.21200 , year =

Xudong Tan, Y aoxin Y ang, Peng Y e, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, and Tao Chen. Think twice, act once: Token-aware compression and action reuse for efﬁcient inference in vision-language-action models. arXiv preprint arXiv:2505.21200, 2025

work page arXiv 2025

[19] [19]

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

Hanzhen Wang, Jiaming Xu, Y ushun Xiang, Jiayi Pan, Y ongkang Zhou, Y ong-Lu Li, and Guohao Dai. Specprune-vla: Accelerating vision-language-action models via action-aware self-speculative pruning. arXiv preprint arXiv:2509.05614, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Action-aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

Xiaohuan Pei, Y uxing Chen, Siyu Xu, Y unke Wang, Y uheng Shi, and Chang Xu. Action- aware dynamic pruning for efﬁcient vision-language-action manipulation. arXiv preprint arXiv:2509.22093, 2025

work page arXiv 2025

[21] [21]

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

Ziyan Liu, Y eqiu Chen, Hongyi Cai, Tao Lin, Shuo Y ang, Zheng Liu, and Bo Zhao. Vla-pruner: Temporal-aware dual-level visual token pruning for efﬁcient vision-language-action inference. arXiv preprint arXiv:2511.16449, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

arXiv preprint arXiv:2509.12594 , year =

Titong Jiang, Xuefeng Jiang, Y uan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Y ahui Liu, Sheng Sun, and Xianpeng Lang. The better you learn, the smarter you prune: To- wards efﬁcient vision-language-action models via differentiable token pruning. arXiv preprint arXiv:2509.12594, 2025

work page arXiv 2025

[23] [23]

Ts-dp: Reinforcement speculative decoding for temporal adaptive diffusion policy acceleration

Y e Li, Jiahe Feng, Y uan Meng, Kangye Ji, Chen Tang, Xinwan Wen, Shutao Xia, Zhi Wang, and Wenwu Zhu. Ts-dp: Reinforcement speculative decoding for temporal adaptive diffusion policy acceleration. arXiv preprint arXiv:2512.15773, 2025

work page arXiv 2025

[24] [24]

Block-wise Adaptive Caching for Accelerating Diffusion Policy

Kangye Ji, Y uan Meng, Hanyun Cui, Y e Li, Shengjia Hua, Lei Chen, and Zhi Wang. Block- wise adaptive caching for accelerating diffusion policy. arXiv preprint arXiv:2506.13456 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Accelerating vision-language-action model integrated with action chunking via parallel decoding

Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated with action chunking via parallel decoding. arXiv preprint arXiv:2503.02310, 2025

work page arXiv 2025

[26] [26]

FASTER: Rethinking Real-Time Flow VLAs

Y uxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Y ang, Jinghua Hou, Junyi Li, Kaixin Ding, and Hengshuang Zhao. Faster: Rethinking real-time ﬂow vlas. arXiv preprint arXiv:2603.19199 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

An internal model for sensori- motor integration

Daniel M Wolpert, Zoubin Ghahramani, and Michael I Jordan. An internal model for sensori- motor integration. Science, 269(5232):1880–1882, 1995

1995

[28] [28]

Signal-dependent noise determines motor plan- ning

Christopher M Harris and Daniel M Wolpert. Signal-dependent noise determines motor plan- ning. Nature, 394(6695):780–784, 1998

1998

[29] [29]

Optimal feedback control as a theory of motor coor- dination

Emanuel Todorov and Michael I Jordan. Optimal feedback control as a theory of motor coor- dination. Nature neuroscience, 5(11):1226–1235, 2002

2002

[30] [30]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Y u Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

A ddpg-based solution for optimal consensus of continuous-time linear multi-agent systems

Y e Li, ZhongXin Liu, Ge Lan, Malika Sader, and ZengQiang Chen. A ddpg-based solution for optimal consensus of continuous-time linear multi-agent systems. Science China Technologi- cal Sciences, 66(8):2441–2453, 2023

2023

[32] [32]

A novel data-driven model-free synchro- nization protocol for discrete-time multi-agent systems via td3 based algorithm

Zhongxin Liu, Y e Li, Ge Lan, and Zengqiang Chen. A novel data-driven model-free synchro- nization protocol for discrete-time multi-agent systems via td3 based algorithm. Knowledge- Based Systems, 287:111430, 2024. 11

2024

[33] [33]

Ttf-vla: Temporal token fusion via pixel-attention integration for vision- language-action models

Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, and Huiling Duan. Ttf-vla: Temporal token fusion via pixel-attention integration for vision- language-action models. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 40, pages 18452–18459, 2026

2026

[34] [34]

Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning

Kangye Ji, Y uan Meng, Zhou Jianbo, Y e Li, Hanyun Cui, and Zhi Wang. Sparse actiongen: Accelerating diffusion policy with real-time pruning. arXiv preprint arXiv:2601.12894, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision , pages 19–35. Springer, 2024

2024

[36] [36]

arXiv preprint arXiv:2503.20384 (2025)

Rongyu Zhang, Menghang Dong, Y uan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Dan Wang, Y uan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision lan- guage action model via mixture-of-layers for efﬁcient robot manipulation. arXiv preprint arXiv:2503.20384, 2025. 12 Limitations and Responsible Use This work is a technical study of infe...

work page arXiv 2025