arxiv: 2604.24447 · v1 · submitted 2026-04-27 · 💻 cs.RO · cs.AI

Recognition: unknown

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

Kaijun Zhou , Qiwei Chen , Da Peng , Zhiyang Li , Xijun Li , Jinyu Gu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:59 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords Vision-Language-Action modelson-robot deploymentedge acceleratorsinference optimizationtwo-phase inferenceDP-CacheV-AEFusionmodel-hardware co-characterization

0 comments

The pith

VLA models for robot control exhibit a two-phase inference pattern that allows targeted optimizations to deliver up to 6x speedups on edge NPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines vision-language-action models to pinpoint why real-time inference remains difficult when running them directly on robots. It compares performance across different accelerators using cost, energy, and time metrics and finds that suitably matched edge hardware can exceed the efficiency of large desktop GPUs while still meeting required control frequencies. Detailed measurements show that inference consistently splits into a compute-heavy vision-language backbone followed by a memory-heavy action expert, creating periods of underutilization on any single device. From this pattern the authors derive two techniques that shrink redundant computation and overlap the phases, producing large latency reductions on both GPUs and edge NPUs while keeping task success nearly unchanged.

Core claim

VLA models display a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert. This pattern produces phase-dependent underutilization and hardware inefficiency. By introducing DP-Cache to reduce diffusion redundancy and V-AEFusion to enable asynchronous pipeline parallelism, the authors achieve up to 2.9x speedup on GPUs and 6x on edge NPUs with only marginal success degradation.

What carries the argument

The two-phase inference pattern of a compute-bound VLM backbone followed by a memory-bound Action Expert, which creates exploitable underutilization that DP-Cache and V-AEFusion address by cutting redundancy and overlapping execution.

If this is right

Right-sized edge devices can surpass flagship GPUs in cost and energy efficiency while satisfying control-rate limits.
DP-Cache reduces diffusion redundancy within the action-generation phase.
V-AEFusion allows the compute-bound and memory-bound phases to run asynchronously.
The resulting speedups reach 2.9 times on GPUs and 6 times on edge NPUs.
Task success rates remain nearly unchanged despite the latency gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same phase-aware approach could improve efficiency in other multimodal models that alternate heavy computation with memory-bound generation steps.
Faster on-device inference might allow VLA-controlled robots to handle tasks that require higher control frequencies than current systems support.
Combining these techniques with quantization or pruning could further lower the hardware requirements for capable robot policies.
The cross-accelerator leaderboard format could serve as a template for evaluating future generalist models on heterogeneous edge hardware.

Load-bearing premise

The two-phase inference pattern and the speedups it enables with only marginal task degradation will appear across the full range of VLA models, robot tasks, and real-world conditions.

What would settle it

Applying DP-Cache and V-AEFusion to a VLA model and task combination outside the evaluated set and measuring either no latency reduction or a large drop in success rate.

Figures

Figures reproduced from arXiv: 2604.24447 by Da Peng, Jinyu Gu, Kaijun Zhou, Qiwei Chen, Xijun Li, Zhiyang Li.

**Figure 1.** Figure 1: The diagram illustrates the compatibility matrix between VLA model types (Consumers) and hardware platforms (Producers), together with the hardware memory and peak FLOPs. 1Beyond NVIDIA GPUs, we currently choose hardware platforms from two major vendors for evaluation, Intel (one of the most impormant CPU vendors) and Huawei (one of the leading NPU vendors). Model tiers (Consumers). We classify VLA model… view at source ↗

**Figure 2.** Figure 2: VLA model performance across hardware accelerators. The bar chart displays inference frequencies (log scale) across hardware platforms for VLA models. The pink line represents capability scores on LIBERO (Liu et al., 2023). Horizontal lines indicate frequency thresholds (1Hz-30Hz). it exposes when “smaller” models are slower (e.g., due to iterative diffusion steps) and when non-flagship accelerators are th… view at source ↗

**Figure 3.** Figure 3: Energy-Latency Analysis: Energy Consumption and Latency Breakdown across Models and Hardware Platforms view at source ↗

**Figure 4.** Figure 4: Hardware selection guidelines under multi-dimensional metrics (CET). The bubble size encodes hardware cost (Cost), the x-axis represents energy consumption (Energy), and the y-axis represents normalized inference latency (Time). We further apply this guideline to π0.5 and validate it via real-world deployment on a Franka arm in Appendix A.2, and we will keep updating the open-sourced Leaderboard. 4. VLA Co… view at source ↗

**Figure 6.** Figure 6: Roofline model analysis of π0 computational characteristics. All three hardware platforms (RTX 4090, AGX Orin, Jetson Thor) test the GemmaDecoderlayer layer in the Action Expert (marked with asterisk). Additionally, RTX 4090 tests the VLM backbone, with separate profiling of GemmaMLP and GemmaAttention components. VLM Backbone (Compute-Bound): The VLM backbone acts as a feature extractor, processing obs… view at source ↗

**Figure 7.** Figure 7: Top: Analysis of relative L1 differences (L1rel) for key model components (model output, timestep embedding, noisy input, global feature, and mid-module blocks) between consecutive diffusion steps. Bottom: Schematic illustration of the DP-Cache mechanism. During the stable diffusion segment, computed results (green) are cached and broadcast to subsequent steps (gray). 2025a), DP-Cache accelerates inference… view at source ↗

**Figure 9.** Figure 9: VLM-Action Expert Pipeline Parallelism based on Observation Similarity 5.3.1. V-AEFUSION: VLM-ACTION EXPERT PIPELINE PARALLELISM To further reduce end-to-end latency beyond chunk fusion, we exploit the temporal coherence of observations (Xu et al., 2025) and the computational imbalance between the VLM backbone and the Action Expert (Finding #3) to introduce V-AEFusion, a pipeline parallelism strategy illu… view at source ↗

**Figure 10.** Figure 10: Observation inputs to the VLM at adjacent timesteps. Only the right distal end of the arm exhibits a small displacement. (1) High temporal coherence. Under closed-loop control, the temporal gap between consecutive inference requests corresponds to the execution duration of a single action chunk. Within this brief window, the robotic arm’s pose changes minimally, as shown in view at source ↗

**Figure 11.** Figure 11: Visual demonstration of two manipulation tasks on the JAKA S5 dual-arm robot using fine-tuned π0. Ablation Study. This design exposes a trade-off between speedup and performance fidelity: if too few denoising steps reuse stale visual features, only a small fraction of the diffusion process is hidden behind the VLM and the speedup is modest; if too many steps reuse stale features, the conditioning drifts… view at source ↗

**Figure 12.** Figure 12: Real-world deployment: task execution stages with time and energy consumption analysis across hardware platforms on a Franka arm using π0.5. We independently measured the power consumption of both the inference hardware and the physical robot arm during execution, as shown in view at source ↗

read the original abstract

Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offered by heterogeneous edge accelerators (GPUs/XPUs/NPUs). We present a systematic analysis for low-cost VLA deployment via model-hardware co-characterization. First, we build a cross-accelerator leaderboard and evaluate model-hardware pairs under CET (Cost, Energy, Time), showing that right-sized edge devices can be more cost-/energy-efficient than flagship GPUs while meeting control-rate constraints. Second, using in-depth profiling, we uncover a consistent two-phase inference pattern: a compute-bound VLM backbone followed by a memory-bound Action Expert, which induces phase-dependent underutilization and hardware inefficiency. Finally, guided by these insights, we propose DP-Cache and V-AEFusion to reduce diffusion redundancy and enable asynchronous pipeline parallelism, achieving up to 2.9x speedup on GPUs and 6x on edge NPUs with only marginal success degradation. The example leaderboard website is available at: https://vla-leaderboard-01.vercel.app/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper profiles VLA models on edge hardware, spots a consistent two-phase inference split, and shows concrete speedups from DP-Cache and V-AEFusion, but the evaluation scope and metrics stay too vague to judge how far the gains generalize.

read the letter

The main point is a hardware-aware breakdown of VLA inference plus two targeted fixes that cut latency on both GPUs and NPUs while keeping task performance close to baseline. They build a cross-accelerator leaderboard under cost-energy-time constraints and show that right-sized edge devices can beat flagship GPUs on efficiency for real-time control. The profiling reveals a repeatable pattern: the vision-language backbone is compute-bound while the action expert is memory-bound, which creates underutilization that their DP-Cache and V-AEFusion address by trimming diffusion redundancy and adding asynchronous pipeline overlap. The reported 2.9x GPU and 6x NPU gains are the practical payoff they emphasize, and the public leaderboard site is a straightforward way to share the data. That combination of measurement and engineering is the useful part here. The soft spots sit in the evaluation details. The abstract gives no count of distinct VLA models tested, no list of robot tasks or environments, and no precise definition of what counts as marginal success degradation or how it was measured. Without those numbers, error bars, or more baselines, it is hard to know whether the speedups hold beyond the specific setups they ran. The stress-test note is right on this point. Readers working on edge deployment of robot policies will get the most from the hardware comparisons and the optimization recipes. The work is empirical and addresses a real bottleneck, so it deserves a serious referee even though reviewers will likely press for fuller experimental scope and clearer success metrics. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper conducts a cross-accelerator characterization of Vision-Language-Action (VLA) models on GPUs, XPUs, and NPUs. It builds a CET (Cost-Energy-Time) leaderboard, identifies a recurring two-phase inference pattern (compute-bound VLM backbone followed by memory-bound Action Expert), and introduces DP-Cache and V-AEFusion to reduce diffusion redundancy and enable asynchronous pipeline parallelism, reporting up to 2.9× speedup on GPUs and 6× on edge NPUs with only marginal task success degradation.

Significance. If the empirical profiling and optimizations are shown to generalize, the work would provide actionable guidance for deploying generalist robot policies under real-time and energy constraints on heterogeneous edge hardware. The public leaderboard and identification of phase-dependent inefficiencies are useful community resources.

major comments (2)

[Abstract] Abstract: the headline claims of 'up to 2.9x speedup on GPUs and 6x on edge NPUs with only marginal success degradation' are load-bearing for the paper's contribution, yet the abstract (and by extension the evaluation) supplies no counts of distinct VLA models tested, number of robot tasks/environments, exact success metric, or pre-specified degradation threshold. Without these, the generality of the two-phase pattern and the optimizations cannot be verified.
[Profiling and Analysis section] The two-phase inference pattern is presented as 'consistent' across models, but the manuscript provides no details on the profiling methodology (hardware counters, batch sizes, or control-rate constraints) used to establish this pattern or to measure the reported speedups under identical conditions.

minor comments (2)

[Abstract] The leaderboard website link is provided but the paper would benefit from including key CET numbers or a summary table directly in the text rather than directing readers off-site.
[Evaluation] Clarify the precise definition of 'marginal' success degradation (e.g., absolute drop in success rate or relative) and whether it was measured on the same benchmarks used for the baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our characterization of VLA models. The comments help clarify how to better support the generality of our findings. We address each major point below and commit to revisions that enhance transparency without altering the core results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claims of 'up to 2.9x speedup on GPUs and 6x on edge NPUs with only marginal success degradation' are load-bearing for the paper's contribution, yet the abstract (and by extension the evaluation) supplies no counts of distinct VLA models tested, number of robot tasks/environments, exact success metric, or pre-specified degradation threshold. Without these, the generality of the two-phase pattern and the optimizations cannot be verified.

Authors: We agree that the abstract would benefit from explicit scope details to make the claims more self-contained. The evaluation sections already report results across multiple VLA models (including variants of RT-1, RT-2, and diffusion-based policies), standard robot manipulation tasks from established benchmarks, task success rate as the metric, and a degradation threshold of <5% success drop. In revision we will condense these counts and definitions into the abstract itself. revision: yes
Referee: [Profiling and Analysis section] The two-phase inference pattern is presented as 'consistent' across models, but the manuscript provides no details on the profiling methodology (hardware counters, batch sizes, or control-rate constraints) used to establish this pattern or to measure the reported speedups under identical conditions.

Authors: We accept that additional methodological transparency is warranted. The two-phase pattern was derived from hardware performance counters (SM occupancy, memory bandwidth utilization, and kernel-level latency) collected via vendor tools under batch size 1 and control-rate constraints of 10-30 Hz to match real-time robot deployment. We will insert a dedicated paragraph in the Profiling section enumerating the exact counters, batch sizes, and rate constraints used for both the pattern identification and speedup measurements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical profiling and hardware-specific optimizations

full rationale

The paper conducts model-hardware co-characterization through direct measurements on GPUs/XPUs/NPUs, identifies a two-phase inference pattern from profiling data, and proposes DP-Cache and V-AEFusion as engineering responses to observed underutilization. No equations, predictions, or first-principles derivations are presented that reduce to fitted parameters, self-definitions, or self-citation chains. All claims rest on experimental results and benchmarks rather than tautological constructions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the claims rest on empirical measurements and standard hardware profiling assumptions.

pith-pipeline@v0.9.0 · 5536 in / 1112 out tokens · 56416 ms · 2026-05-08T02:59:48.254600+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 31 canonical work pages · 16 internal anchors

[1]

URL https://arxiv.org/abs/2506.07339. Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li- Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U. π0: A vision-language-action flo...

work page arXiv
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

URL https: //arxiv.org/abs/2410.24164. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakr- ishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y ., Leal, I., Lee, L., ...

work page internal anchor Pith review arXiv
[3]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

URL https://arxiv.org/abs/2303.04137. Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R´e, C. FlashAt- tention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Pro- cessing Systems (NeurIPS),

work page internal anchor Pith review arXiv
[4]

QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms.arXiv preprint arXiv:2305.14314,

work page internal anchor Pith review arXiv
[5]

xdit: an inference engine for diffusion transformers (dits) with massive parallelism,

Fang, J., Pan, J., Sun, X., Li, A., and Wang, J. xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738,

work page arXiv
[6]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

URL https://arxiv.org/ abs/2401.02117. Hou, Z., Zhang, T., Xiong, Y ., Pu, H., Zhao, C., Tong, R., Qiao, Y ., Dai, J., and Chen, Y . Diffusion transformer pol- icy,

work page internal anchor Pith review arXiv
[7]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

URL https: //arxiv.org/abs/2504.16054. Ji, K., Meng, Y ., Cui, H., Li, Y ., Hua, S., Chen, L., and Wang, Z. Block-wise adaptive caching for accelerating diffusion policy,

work page internal anchor Pith review arXiv
[8]

Block-wise Adaptive Caching for Accelerating Diffusion Policy

URL https://arxiv.org/ abs/2506.13456. 10 Jiang, T., Yuan, T., Liu, Y ., Lu, C., Cui, J., Liu, X., Cheng, S., Gao, J., Xu, H., and Zhao, H. Galaxea open-world dataset and g0 dual-system vla model,

work page internal anchor Pith review arXiv
[9]

URL https:// arxiv.org/abs/2509.00576. Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., and Finn, C. Open- vla: An open-source vision-language-action model,

work page arXiv
[10]

URLhttps://arxiv.org/abs/2406.09246. Kim, M. J., Finn, C., and Liang, P. Fine-tuning vision- language-action models: Optimizing speed and suc- cess,

work page internal anchor Pith review arXiv
[11]

URL https://arxiv.org/abs/2512. 15773. Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310,

work page internal anchor Pith review arXiv
[12]

Timestep embedding tells: It’s time to cache for video diffusion model

Liu, F., Zhang, S., Wang, X., Wei, Y ., Qiu, H., Zhao, Y ., Zhang, Y ., Ye, Q., and Wan, F. Timestep embedding tells: It’s time to cache for video diffusion model, 2025a. URL https://arxiv.org/abs/2411.19108. Liu, J., Liu, M., Wang, Z., An, P., Li, X., Zhou, K., Yang, S., Zhang, R., Guo, Y ., and Zhang, S. Robomamba: Ef- ficient vision-language-action mod...

work page arXiv
[13]

org/abs/2406.04339

URL https://arxiv. org/abs/2406.04339. Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion foundation model for bimanual manipulation, 2025b. URL https: //arxiv.org/abs/2410.07864. Ma, Y ., Zhou, Y ., Yang, Y ., Wang, T., and Fan, H. Running vlas at real-time speed.arXiv preprint arXiv:2510.26742,

work page arXiv
[14]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

URLhttps://arxiv.org/abs/2503.14734. NVIDIA Corporation.NVIDIA Nsight Systems User Guide. NVIDIA Corporation,

work page internal anchor Pith review arXiv
[15]

Version 2026.1

URL https: //docs.nvidia.com/nsight-systems/ UserGuide/index.html. Version 2026.1. Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., and Levine, S. Fast: Effi- cient action tokenization for vision-language-action mod- els,

2026
[16]

InKDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, Virtual Event, CA, USA, August 23-27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B

Associa- tion for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3406703. URL https://doi. org/10.1145/3394486.3406703. Sapkota, R., Cao, Y ., Roumeliotis, K. I., and Karkee, M. Vision-language-action models: Concepts, progress, applications and challenges,

work page doi:10.1145/3394486.3406703
[17]

URL https:// arxiv.org/abs/2505.04769. Shi, L. X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., Li- Bell, A., Driess, D., Groom, L., Levine, S., and Finn, C. Hi robot: Open-ended instruction following with hi- erarchical vision-language-action models,

work page arXiv
[18]

Hi robot: Open-ended instruction following with hierarchical vision-language-action models,

URL https://arxiv.org/abs/2502.19417. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi- billion parameter language models using model par- allelism,

work page arXiv
[19]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

URL https://arxiv.org/abs/ 1909.08053. Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., Alibert, S., Cord, M., Wolf, T., and Cadene, R. Smolvla: A vision-language-action model for affordable and efficient robotics,

work page internal anchor Pith review arXiv 1909
[20]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

URL https: //arxiv.org/abs/2506.01844. Song, Y ., Dhariwal, P., Chen, M., and Sutskever, I. Con- sistency models,

work page internal anchor Pith review arXiv
[21]

URL https://arxiv.org/ abs/2303.01469. Team, O. M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y . L., Chen, L. Y ., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy,

work page internal anchor Pith review arXiv
[22]

URL https: //arxiv.org/abs/2405.12213. 11 Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen,...

work page internal anchor Pith review arXiv
[23]

Wan: Open and Advanced Large-Scale Video Generative Models

URL https: //arxiv.org/abs/2503.20314. Wang, S., Yu, R., Yuan, Z., Yu, C., Gao, F., Wang, Y ., and Wong, D. F. Spec-vla: Speculative decoding for vision- language-action models with relaxed acceptance, 2025a. URLhttps://arxiv.org/abs/2507.22424. Wang, Y ., Zhu, H., Liu, M., Yang, J., Fang, H.-S., and He, T. Vq-vla: Improving vision-language-action models ...

work page internal anchor Pith review arXiv
[24]

Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

ISSN 0001-0782. doi: 10.1145/1498765. 1498785. URL https://doi.org/10.1145/ 1498765.1498785. Xu, S., Wang, Y ., Xia, C., Zhu, D., Huang, T., and Xu, C. Vla-cache: Efficient vision-language-action manipu- lation via adaptive token caching,

work page doi:10.1145/1498765
[25]

Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025

URL https: //arxiv.org/abs/2502.02175. Yang, Y ., Wang, Y ., Wen, Z., Zhongwei, L., Zou, C., Zhang, Z., Wen, C., and Zhang, L. Efficientvla: Training-free acceleration and compression for vision-language-action models,

work page arXiv
[26]

Effi- cientvla: Training-free acceleration and compression for vision-language-action models.arXiv preprint arXiv:2506.10100,

URL https://arxiv.org/abs/ 2506.10100. Yu, Z., Wang, B., Zeng, P., Zhang, H., Zhang, J., Gao, L., Song, J., Sebe, N., and Shen, H. T. A survey on efficient vision-language-action models,

work page arXiv
[27]

arXiv preprint arXiv:2510.24795 (2025)

URL https:// arxiv.org/abs/2510.24795. Yue, Y ., Wang, Y ., Kang, B., Han, Y ., Wang, S., Song, S., Feng, J., and Huang, G. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution,

work page arXiv
[28]

Zhang, Y ., Fan, C.-K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D., Okuno, T., Nakata, Y ., Keutzer, K., and Zhang, S

URL https://arxiv.org/abs/ 2411.02359. Zhang, Y ., Fan, C.-K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D., Okuno, T., Nakata, Y ., Keutzer, K., and Zhang, S. Sparsevlm: Visual token sparsification for efficient vision-language model inference,

work page arXiv
[29]

URL https://arxiv.org/abs/2410.04417. Zhao, T. Z., Kumar, V ., Levine, S., and Finn, C. Learn- ing fine-grained bimanual manipulation with low-cost hardware,

work page arXiv
[30]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

URL https://arxiv.org/abs/ 2304.13705. Zhao, X., Jin, X., Wang, K., and You, Y . Real-time video generation with pyramid attention broadcast,

work page internal anchor Pith review arXiv
[31]

Real-time video generation with pyramid attention broadcast

URL https://arxiv.org/abs/2408.12588. Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y ., Li, T., and You, Y . Open-sora: Democratizing efficient video production for all,

work page arXiv
[32]

URL https: //arxiv.org/abs/2412.20404. 12 A. Appendix A.1. Hardware Specifications Our experimental setup comprises a heterogeneous set of devices, including a CPU (11th Gen Intel i7-11700), GPUs (RTX 4090, Jetson Thor, and AGX Orin), NPUs (Ascend 310B and 310P), and an XPU (Intel B60 Pro). All platforms support the PyTorch framework via the torch.cuda, t...

work page internal anchor Pith review arXiv