DIPOLE: Fusing Vision and Geometry for Robust Visuomotor Generalization
read the original abstract
Imitation learning has emerged as a crucial approach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods tend to struggle once test-time conditions differ from the demonstrations, such as changes in lighting, texture, viewpoint, object placement, or object identity. To address this challenge, we propose DIffusion POlicy with compLementarity Encoders (DIPOLE), a visuomotor policy that learns to fuse complementary modalities through a training-time mechanism rather than a specialized fusion architecture. A modality-wise dropout masks one branch at each training step, encouraging each modality to remain individually informative. A lightweight cross-attention layer then exchanges complementary cues between the two. This design endows DIPOLE with five core strengths: stable high performance across diverse tasks, robustness to visual changes, spatial generalization at sub-centimeter precision, emergent capability beyond either modality, and zero-shot transfer to unseen objects. Across 18 simulated and 4 real-world tasks, DIPOLE outperforms six baselines by 39.1% on average, with gains of 41.5% under unseen visual distractors and 15.2% under randomized object placement.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.