ViTaPEs uses two-stage positional encodings in a multimodal transformer to learn task-agnostic visuotactile representations that outperform baselines on recognition tasks, show zero-shot generalization, and improve robotic grasp success prediction.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
ViTaPEs uses two-stage positional encodings in a multimodal transformer to learn task-agnostic visuotactile representations that outperform baselines on recognition tasks, show zero-shot generalization, and improve robotic grasp success prediction.