ViTaPEs uses two-stage positional encodings in a multimodal transformer to learn task-agnostic visuotactile representations that outperform baselines on recognition tasks, show zero-shot generalization, and improve robotic grasp success prediction.
Each object is documented through high-quality 3D meshes, HD rotation videos, and multiple tactile recordings from a GelSight sensor (Yuan et al., 2017a)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
ViTaPEs uses two-stage positional encodings in a multimodal transformer to learn task-agnostic visuotactile representations that outperform baselines on recognition tasks, show zero-shot generalization, and improve robotic grasp success prediction.