DynaFLIP pre-trains dynamics-aware image encoders by aligning image, language, and 3D flow modalities through simplex-volume minimization plus regularizers on video triplets, yielding reusable backbones that improve manipulation policies by up to 22.5% in out-of-distribution settings.
Towards uniformity and alignment for multimodal representation learning.arXiv preprint arXiv:2602.09507
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it