KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

Zihao Zheng , Zhihao Mao , Maoliang Li , Jiayu Chen , Xinhao Sun , Zhaobo Zhang , Donggang Cao , Hong Mei

show 1 more author

Xiang Chen

Authors on Pith no claims yet

classification 💻 cs.RO cs.LG

keywords addresserrorsissueskervmodelsthresholdacceptanceadjustment

0 comments

read the original abstract

Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
cs.RO 2026-05 unverdicted novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
cs.RO 2026-03 unverdicted novelty 7.0

HeiSD delivers up to 2.45x faster inference for embodied VLA models by hybridizing speculative decoding with kinematic boundary detection and error-mitigation tricks while preserving task success rates.
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
cs.RO 2026-03 conditional novelty 7.0

VLN-Cache delivers up to 1.52x faster inference in VLN models by using view-aligned remapping for geometric consistency and a task-relevance saliency filter to manage semantic changes during navigation.
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
cs.RO 2026-04 unverdicted novelty 6.0

FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models
cs.DC 2026-03 unverdicted novelty 5.0

RoboECC delivers up to 3.28x speedup for VLA model inference via co-aware segmentation and network-aware adjustment with 2.55-2.62% overhead.