Robot-DIFT: Correspondence-Sensitive Diffusion Features for Contact-Rich Robot Manipulation

Georgia Chalvatzaki; Gerhard Neumann; Jiahong Xue; Xiaogang Jia; Yu Deng; Yufeng Jin

arxiv: 2602.11934 · v2 · pith:KQCSR3QAnew · submitted 2026-02-12 · 💻 cs.RO

Robot-DIFT: Correspondence-Sensitive Diffusion Features for Contact-Rich Robot Manipulation

Yu Deng , Yufeng Jin , Xiaogang Jia , Jiahong Xue , Gerhard Neumann , Georgia Chalvatzaki This is my paper

classification 💻 cs.RO

keywords diffusioncorrespondencerobot-diftfeaturefeaturesmanipulationpolicysemantic

0 comments

read the original abstract

Robot manipulation often fails in the final millimeters: a policy may recognize the right object yet miss the pose offsets, boundaries, or pre-contact alignments needed for action. We argue that such failures arise when semantic invariance suppresses correspondence cues for closed-loop control, or when these cues are not exposed to the policy in a usable form. Modern visual encoders provide strong semantic abstractions, but contact-rich manipulation requires correspondence sensitivity: discriminative feature responses to action-relevant changes in pose, boundary, and contact geometry. Diffusion features provide a strong prior for dense correspondence, but direct use is impractical due to stochasticity, latency, and representation drift. We introduce Robot-DIFT, a deterministic diffusion-derived backbone for real-time control. Through Manifold Distillation, Robot-DIFT converts a noise-conditioned diffusion Teacher into a clean-input, single-pass Student while preserving the teacher's feature manifold. A Spatial--Semantic Feature Pyramid Network (S2-FPN) fuses coarse-to-fine Student decoder features into visual tokens that expose semantic context and fine contact detail to the policy. Across RoboCasa, LIBERO-10, and real robots, Robot-DIFT outperforms vision--language, self-supervised, geometry-oriented, and diffusion baselines on contact-sensitive tasks. Controlled backbone/readout swaps show that S2-FPN unlocks, rather than replaces, the diffusion correspondence prior.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
cs.LG 2026-05 unverdicted novelty 6.0

Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment
cs.RO 2026-04 unverdicted novelty 5.0

LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-dist...