Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

· 2026 · cs.CV · arXiv 2603.23202

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer's attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.

representative citing papers

Vision-Language-Action Models: Experimental Insights from a Real-World UR5 Platform

cs.RO · 2026-06-29 · unverdicted · novelty 3.0

Real-robot trials with OpenVLA on a UR5e arm show consistent offline-to-closed-loop gaps driven by action semantics, coordinate conventions, temporal alignment, image preprocessing, and dataset quality rather than model capacity.

citing papers explorer

Showing 1 of 1 citing paper.

Vision-Language-Action Models: Experimental Insights from a Real-World UR5 Platform cs.RO · 2026-06-29 · unverdicted · none · ref 24 · internal anchor
Real-robot trials with OpenVLA on a UR5e arm show consistent offline-to-closed-loop gaps driven by action semantics, coordinate conventions, temporal alignment, image preprocessing, and dataset quality rather than model capacity.

Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

fields

years

verdicts

representative citing papers

citing papers explorer