pith. sign in

arxiv: 2509.05614 · v3 · pith:IW7BHYYVnew · submitted 2025-09-06 · 💻 cs.CV · cs.AI· cs.RO

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

classification 💻 cs.CV cs.AIcs.RO
keywords pruninggloballocalspecprune-vlaaccelerationactionaction-awarecontext
0
0 comments X
read the original abstract

Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing acceleration methods focus on local information from the current action step and ignore the global context, leading to >20% success rate drop and limited speedup in some scenarios. In this paper, we point out spatial-temporal consistency in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. (1) Action-level static pruning. We leverage global history and local attention to statically reduce visual tokens per action. (2) Layer-level dynamic pruning. We prune tokens adaptively per layer based on layer-wise importance. (3) Lightweight action-aware controller: We classify actions as coarse- or fine-grained by the speed of the end effector and adjust pruning aggressiveness accordingly. Extensive experiments show that SpecPrune-VLA achieves up to 1.57$\times$ speedup in LIBERO simulation and 1.70$\times$ on real-world tasks, with negligible success rate degradation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

    cs.RO 2026-05 unverdicted novelty 7.0

    A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

  2. Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference

    cs.RO 2026-05 unverdicted novelty 7.0

    Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.

  3. KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

    cs.RO 2026-03 unverdicted novelty 7.0

    KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.

  4. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 conditional novelty 6.0

    FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.

  5. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 unverdicted novelty 6.0

    FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.

  6. FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

    cs.CV 2025-09 conditional novelty 6.0

    FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.

  7. AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

    cs.LG 2025-11 unverdicted novelty 5.0

    AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.