VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.
arXiv preprint arXiv:2505.21200 (2025)
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.
VLA-InfoEntropy accelerates Vision-Language-Action model inference by using visual entropy, attention entropy, and timestep cues to prune redundant tokens while preserving task-critical content.
OxyGen unifies KV cache management in MoT VLAs to enable cross-task KV sharing and cross-frame continuous batching, delivering up to 3.7x speedup with 200+ tokens/s language and 70 Hz action on on-device platforms.
ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x speedup with comparable or better performance on embodied tasks.
ElegantVLA accelerates VLA models up to 3.77x by dynamically scheduling compute across vision, language, and action components without retraining the base model.
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
AttenA+ reweights action training objectives in VLA and WAM models via inverse velocity attention to prioritize kinematically critical segments, yielding small benchmark gains.
citing papers explorer
-
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x speedup with comparable or better performance on embodied tasks.
-
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.