UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies

Chuer Pan; Dongjae Lee; Guanya Shi; Harsh Gupta; Huy Ha; Muqing Cao; Sebastian Scherer; Shuran Song; Xiaofeng Guo

arxiv: 2510.02614 · v3 · pith:TJQI2YV7new · submitted 2025-10-02 · 💻 cs.RO

UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies

Harsh Gupta , Xiaofeng Guo , Huy Ha , Chuer Pan , Muqing Cao , Dongjae Lee , Sebastian Scherer , Shuran Song

show 1 more author

Guanya Shi

This is my paper

classification 💻 cs.RO

keywords embodiment-awarepoliciesdeploymentdiffusionmanipulationumi-on-airaerialapproach

0 comments

read the original abstract

We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodiments-such as aerial manipulators-is the mismatch in control and robot dynamics, which often leads to out-of-distribution behaviors and poor execution. To address this, we propose Embodiment-Aware Diffusion Policy (EADP), which couples a high-level UMI policy with a low-level embodiment-specific controller at inference time. By integrating gradient feedback from the controller's tracking cost into the diffusion sampling process, our method steers trajectory generation towards dynamically feasible modes tailored to the deployment embodiment. This enables plug-and-play, embodiment-aware trajectory adaptation at test time. We validate our approach on multiple long-horizon and high-precision aerial manipulation tasks, showing improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. Finally, we demonstrate deployment in previously unseen environments, using UMI demonstrations collected in the wild, highlighting a practical pathway for scaling generalizable manipulation skills across diverse-and even highly constrained-embodiments. All code, data, checkpoints, and result videos can be found at umi-on-air.github.io.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation
cs.CV 2026-05 unverdicted novelty 7.0

The paper introduces SP-VTP as a new setting for egocentric manipulation, releases the EgoSPT dataset with first-frame spatial annotations, and proposes the SPOT model that outperforms non-prompted baselines on cross-...
HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds
cs.RO 2026-05 unverdicted novelty 6.0

HCLM presents a hierarchical architecture that uses an SE(3)-invariant diffusion policy for coordination and a hybrid whole-body controller with MPC and admittance control for safe closed-chain loco-manipulation on du...
BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.
Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies
cs.RO 2026-04 unverdicted novelty 6.0

A tactile-aware hierarchical policy for quadrupedal loco-manipulation improves real-world contact-rich task performance by 28.54% over vision-only and visuotactile baselines.
XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios
cs.RO 2026-04 unverdicted novelty 6.0

XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.
Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies
cs.RO 2026-04 unverdicted novelty 5.0

A hierarchical tactile-aware policy combines human-demonstration training for contact cue prediction with sim-to-real reinforcement learning to improve quadrupedal loco-manipulation performance by 28.54% over vision b...