pith. sign in

arxiv: 2511.16449 · v5 · pith:6CYWULPKnew · submitted 2025-11-20 · 💻 cs.CV · cs.AI

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

classification 💻 cs.CV cs.AI
keywords visualpruninginferencemanipulationmodelssemantictokentokens
0
0 comments X
read the original abstract

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, and action execution. In real-time deployment, these models must process continuous visual streams, incurring substantial computational overhead. Visual token pruning -- a mainstream technique for accelerating Vision-Language Models (VLMs) by retaining salient tokens while discarding redundant ones -- offers a natural candidate solution to this challenge. However, directly applying VLM-oriented pruning methods to VLA inference can cause severe degradation in manipulation performance. Our analysis attributes this degradation to a key mismatch: VLA inference exhibits distinct attention patterns between the vision-language prefill stage and the action-decode stage, so pruning based only on context-prefill semantic salience is biased toward semantic cues and may remove action-critical visual tokens. Motivated by this observation, we propose VLA-Pruner, an effective plug-and-play token pruning method grounded in the visual requirements of VLA inference, further exploiting the temporal continuity of robot manipulation. Specifically, VLA-Pruner estimates visual-token importance from both semantic prefilling and temporally smoothed action relevance, and then applies a Combine-then-Filter strategy to retain compact, non-redundant tokens under the compute budget. Experiments show that VLA-Pruner outperforms state-of-the-art approaches across multiple VLA architectures, achieving up to 1.99x speedup with comparable manipulation quality.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

    cs.RO 2026-05 unverdicted novelty 7.0

    Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.

  2. Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference

    cs.RO 2026-05 unverdicted novelty 7.0

    Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.

  3. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 conditional novelty 6.0

    FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.

  4. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 unverdicted novelty 6.0

    FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.