pith. sign in

arxiv: 2605.20894 · v1 · pith:BWTWGWPFnew · submitted 2026-05-20 · 💻 cs.RO

Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation

Pith reviewed 2026-05-21 04:50 UTC · model grok-4.3

classification 💻 cs.RO
keywords mobile manipulationimitation learningdiffusion policykinematic decouplingvisual-inertial alignmentreceding-horizon executionhousehold robotics
0
0 comments X

The pith

Decoupling base locomotion from hand manipulation with a one-shot camera anchor and online state realignment lets standard diffusion policies succeed on mobile tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that mobile imitation learning suffers from action labels polluted by human walking and from delays when generating long action sequences on a moving base. It solves this with a dual-camera setup where a chest camera gives global context and a wrist camera gives local detail. A quick ChArUco calibration unifies the two views so hand actions can be expressed relative to the chest, cleanly separating base movement from manipulation. An asynchronous executor then matches each new action prediction to the robot's latest position, skipping any outdated waypoints. If these steps work as claimed, portable human demonstrations become usable for training mobile robots on complex tasks without redesigning the policy model.

Core claim

By recording demonstrations with chest and wrist cameras and using a one-shot ChArUco-based spatial anchor to re-express hand poses relative to the chest, the method extracts independent SE(3) manipulation trajectories and SE(2) base trajectories. An asynchronous receding-horizon executor then performs online state matching so that each generated action chunk is realigned with the current physical pose before execution, discarding expired waypoints. Controlled comparisons show that the chest-relative labels close much of the gap to baselines while the state matching closes the rest, yielding an 83.8 percent average success rate across four long-horizon household tasks without any changes to

What carries the argument

One-shot ChArUco-based spatial anchor that unifies chest and wrist frames to extract decoupled SE(3) hand trajectories relative to the chest and SE(2) base trajectories, together with the asynchronous receding-horizon executor that realigns action chunks via online state matching.

If this is right

  • Chest-relative labels by themselves remove most locomotion contamination and close a large share of the performance gap to wrist-only baselines.
  • Online state matching through receding-horizon execution removes the need for corrective backward motions at action splices caused by base advance during inference.
  • The approach achieves high success on long-horizon tasks while leaving the underlying diffusion policy architecture unchanged.
  • Demonstrations can be collected with portable interfaces that require no robot during data gathering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same factorization could reduce contamination in other human demonstration settings that mix walking and reaching.
  • The state-matching step may generalize to any generative policy whose inference time exceeds the base motion timescale.
  • Replacing the ChArUco board with learned visual anchors could test whether the method still works in changing lighting or without markers.

Load-bearing premise

The one-shot ChArUco-based spatial anchor reliably unifies chest and wrist visual-inertial frames with low error under natural human motion, allowing clean extraction of independent SE(3) and SE(2) trajectories.

What would settle it

A measurement that finds high unification error in the ChArUco anchor during typical walking would show the extracted trajectories remain contaminated and the reported gains cannot be credited to clean factorization.

Figures

Figures reproduced from arXiv: 2605.20894 by Haonan Dong, Haoran Huang, Huixu Dong.

Figure 1
Figure 1. Figure 1: Overview of the Mobile UMI framework. Left: a human operator wearing a chest camera and carrying a handheld camera-gripper module collects [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dual-view demonstration end and embodiment mapping to the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spatial anchoring, VIO integration, and action decoupling. Left: a [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-view conditional diffusion policy architecture. Chest and hand [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spatial-temporal delay compensation (asynchronous state matching). [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tea Bag Placement: qualitative rollout and representative failure modes. Top row shows a successful sequence (init, search-and-grasp, turn-around, [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Book placement: rollout and baseline failure modes under different module ablations. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Turn off the light: rollout and representative failure modes (miss touch, [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Mobile imitation learning on portable demonstration interfaces faces two coupled bottlenecks: locomotion-contaminated action labels and inference-induced execution latency on a continuously moving base. Recent wrist-mounted interfaces lower the cost of tabletop data collection, yet a single wrist view does not capture the global context required for base navigation. Adding a body-mounted camera entangles human walking with hand motion. Meanwhile, generative policies introduce hundreds of milliseconds of inference latency, during which the base advances past predicted waypoints, forcing backward corrections at action splices. This paper presents Mobile UMI, a hardware-free demonstration framework that addresses both gaps through three components. First, a dual-camera capture system records chest-centric global context and wrist-centric local interaction without any robot present. Second, a one-shot ChArUco-based spatial anchor unifies the chest and hand visual-inertial frames; the hand pose is then re-expressed relative to the chest to extract decoupled SE(3) manipulation and SE(2) base trajectories. Third, an asynchronous receding-horizon executor performs online state matching: each generated action chunk is realigned with the current physical pose so that expired waypoints are discarded before execution. The full system is evaluated on four long-horizon household tasks, achieving an average success rate of 83.8% over 100 trials per task. Controlled comparisons against ACT and Diffusion Policy show that the chest-relative label alone closes much of the gap; online state matching closes the remainder. These results indicate that, for mobile imitation learning under the tested conditions, explicit kinematic factorization combined with state-level latency alignment provides an effective solution without requiring architectural changes to the underlying policy class.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Mobile UMI, a hardware-free demonstration framework for mobile imitation learning. It uses a dual-camera (chest and wrist) capture system, a one-shot ChArUco-based spatial anchor to unify visual-inertial frames and extract decoupled SE(3) manipulation and SE(2) base trajectories, and an asynchronous receding-horizon executor for online state matching to mitigate inference latency. On four long-horizon household tasks, it reports 83.8% average success over 100 trials per task, with ablations showing gains from chest-relative labels and online matching over ACT and Diffusion Policy baselines.

Significance. If the decoupling holds, the work demonstrates that explicit kinematic factorization at the data level, paired with state-level latency alignment, can address key bottlenecks in mobile imitation learning without modifying the underlying policy architecture. This could enable more scalable collection of clean labels using portable interfaces and simplify deployment on continuously moving bases.

major comments (2)
  1. [Method section on ChArUco spatial anchor] Method section describing the spatial anchor: The central claim that the one-shot ChArUco anchor 'allows clean extraction' of independent SE(3) and SE(2) trajectories rests on the unverified assumption of low unification error under natural human motion. No quantitative metrics (e.g., pose estimation error, robustness to partial occlusion, IMU drift, or non-rigid motion) or failure cases are reported for this component, despite its load-bearing role in ensuring the extracted labels are locomotion-free and in supporting the ablation results.
  2. [Experimental evaluation] Experimental results: The reported 83.8% success rate and controlled comparisons are promising, but the absence of error bars, variance measures, and detailed failure analysis across the 100 trials per task limits assessment of whether the gains from decoupled labels and online matching are robust or sensitive to the anchor's accuracy.
minor comments (2)
  1. [Abstract] Abstract and introduction: The phrasing 'allows clean extraction' should be tempered or footnoted to reflect the lack of supporting error analysis for the anchor.
  2. [Notation and figures] Notation and figures: Ensure consistent use of SE(3)/SE(2) terminology and clarify how the re-expressed hand pose is computed in any accompanying diagrams.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the work.

read point-by-point responses
  1. Referee: [Method section on ChArUco spatial anchor] Method section describing the spatial anchor: The central claim that the one-shot ChArUco anchor 'allows clean extraction' of independent SE(3) and SE(2) trajectories rests on the unverified assumption of low unification error under natural human motion. No quantitative metrics (e.g., pose estimation error, robustness to partial occlusion, IMU drift, or non-rigid motion) or failure cases are reported for this component, despite its load-bearing role in ensuring the extracted labels are locomotion-free and in supporting the ablation results.

    Authors: We agree that quantitative validation of the ChArUco spatial anchor would provide stronger direct support for the claim of clean decoupled trajectory extraction. The current manuscript presents the anchor as a practical, hardware-free unification step using established computer-vision techniques, with its utility shown indirectly through high task success rates and the ablation comparing chest-relative versus wrist-only labels. To address the referee's concern directly, the revised manuscript will include a new quantitative evaluation subsection (or appendix) reporting pose estimation error, robustness to partial occlusion and natural human motion, IMU drift effects, and observed failure cases, obtained via additional controlled validation experiments. These metrics will clarify the anchor's contribution to locomotion-free labels. revision: yes

  2. Referee: [Experimental evaluation] Experimental results: The reported 83.8% success rate and controlled comparisons are promising, but the absence of error bars, variance measures, and detailed failure analysis across the 100 trials per task limits assessment of whether the gains from decoupled labels and online matching are robust or sensitive to the anchor's accuracy.

    Authors: We acknowledge that the experimental section would benefit from explicit variance reporting and failure-mode analysis to better demonstrate robustness. In the revised manuscript we will add standard deviations (or error bars) to all success-rate tables and figures, and we will expand the results discussion with a categorized breakdown of failure modes across the 100 trials per task (e.g., navigation errors, manipulation errors, timing/latency issues). This will allow readers to assess consistency of the reported gains from decoupled kinematics and online state matching, as well as any sensitivity to anchor accuracy. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained data processing and empirical evaluation

full rationale

The paper describes a dual-camera capture system, one-shot ChArUco spatial anchor for re-expressing hand pose relative to chest to extract decoupled SE(3) and SE(2) trajectories, and an asynchronous receding-horizon executor for state matching. These are presented as engineering components evaluated empirically on four household tasks with success rates and comparisons to ACT and Diffusion Policy baselines. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described method. The kinematic factorization is a preprocessing step on collected data rather than a derived result that reduces to its own inputs by construction. The central claim of effective solution without policy architecture changes rests on the reported controlled comparisons, which are independent of the method definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard computer-vision and control primitives rather than new physical postulates; the primary unverified premise is accurate real-time frame unification under human motion.

axioms (1)
  • domain assumption The ChArUco marker provides accurate one-shot spatial alignment between chest and wrist cameras in dynamic human motion scenarios.
    Invoked when describing unification of visual-inertial frames to extract decoupled trajectories.

pith-pipeline@v0.9.0 · 5830 in / 1430 out tokens · 61387 ms · 2026-05-21T04:50:29.110946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 3 internal anchors

  1. [1]

    Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,

    C. Chiet al., “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” inRSS, 2024

  2. [2]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chiet al., “Diffusion policy: Visuomotor policy learning via action diffusion,” inRSS, 2023

  3. [3]

    Learning predictive models from observation and interaction,

    K. Schmeckpeperet al., “Learning predictive models from observation and interaction,” inECCV, 2020. 8 Fig. 7. Book placement: rollout and baseline failure modes under different module ablations. Fig. 8. Turn off the light: rollout and representative failure modes (miss touch, collision, wrong pose)

  4. [4]

    Learning dexterous grasping with object- centric visual affordances,

    P. Mandikal and K. Grauman, “Learning dexterous grasping with object- centric visual affordances,” inICRA, 2021

  5. [5]

    Human-to-robot imitation in the wild,

    S. Bahlet al., “Human-to-robot imitation in the wild,” arXiv preprint arXiv:2207.09450, 2022

  6. [6]

    RT-Trajectory: Robotic task generalization via hindsight trajectory sketches,

    J. Guet al., “RT-Trajectory: Robotic task generalization via hindsight trajectory sketches,” inICLR, 2024

  7. [7]

    TidyBot++: An open-source holonomic mobile manipu- lator for robot learning,

    J. Wuet al., “TidyBot++: An open-source holonomic mobile manipu- lator for robot learning,” inCoRL, 2024

  8. [8]

    Homer: Learning in-the-wild mobile manipulation via hybrid imitation and whole-body control.arXiv preprint arXiv:2506.01185, 2025

    P. Sundaresanet al., “HoMeR: Learning in-the-wild mobile manip- ulation via hybrid imitation and whole-body control,” arXiv preprint arXiv:2506.01185, 2025

  9. [9]

    MobRT: A digital twin-based framework for scalable learning in mobile manipulation,

    Y . Meiet al., “MobRT: A digital twin-based framework for scalable learning in mobile manipulation,” arXiv preprint arXiv:2510.04592, 2025

  10. [10]

    Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,

    Z. Fuet al., “Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” inICRA, 2024

  11. [11]

    TeleMoMa: A modular and versatile teleoperation system for mobile manipulation,

    S. Dasset al., “TeleMoMa: A modular and versatile teleoperation system for mobile manipulation,” inRSS Workshop, 2024

  12. [12]

    SPIN: Simultaneous perception interaction and navi- gation,

    H. Xionget al., “SPIN: Simultaneous perception interaction and navi- gation,” inCVPR, 2024

  13. [13]

    Mobi-Pi: Mobilizing your robot learning policy,

    J. Yanget al., “Mobi-Pi: Mobilizing your robot learning policy,” in CoRL, 2025

  14. [14]

    LookOut: Real-world humanoid egocentric navigation,

    L. Y . Zhuet al., “LookOut: Real-world humanoid egocentric navigation,” inICCV, 2025

  15. [15]

    DexCap: Scalable and portable mocap data collection system for dexterous manipulation,

    C. Wanget al., “DexCap: Scalable and portable mocap data collection system for dexterous manipulation,” inACM SIGGRAPH, 2024

  16. [16]

    DexUMI: Using human hand as the universal manipula- tion interface for dexterous manipulation,

    M. Xuet al., “DexUMI: Using human hand as the universal manipula- tion interface for dexterous manipulation,” inCoRL, 2025

  17. [17]

    AnyTeleop: A general vision-based dexterous robot arm- hand teleoperation system,

    Y . Qinet al., “AnyTeleop: A general vision-based dexterous robot arm- hand teleoperation system,” inRSS, 2023

  18. [18]

    Holo-Dex: Teaching dexterity with immersive mixed reality,

    S. P. Arunachalamet al., “Holo-Dex: Teaching dexterity with immersive mixed reality,” inICRA, 2023

  19. [19]

    OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,

    A. Iyeret al., “Open Teach: A versatile teleoperation system for robotic manipulation,” arXiv preprint arXiv:2403.07870, 2024

  20. [20]

    GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators,

    P. Wuet al., “GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators,” inIROS, 2024

  21. [21]

    Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning,

    R. Dinget al., “Bunny-VisionPro: Real-time bimanual dexterous teleop- eration for imitation learning,” arXiv preprint arXiv:2407.03162, 2024

  22. [22]

    Open-TeleVision: Teleoperation with immersive active visual feedback,

    X. Chenget al., “Open-TeleVision: Teleoperation with immersive active visual feedback,” inCoRL, 2025

  23. [23]

    EgoMimic: Scaling imitation learning via egocentric video,

    S. Kareeret al., “EgoMimic: Scaling imitation learning via egocentric video,” inICRA, 2025

  24. [24]

    Vision in action: Learning active perception from human demonstrations,

    H. Xionget al., “Vision in action: Learning active perception from human demonstrations,” inCoRL, 2025

  25. [25]

    AnyDexGrasp: General dexterous grasping for different hands with human-level learning efficiency,

    H. S. Fanget al., “AnyDexGrasp: General dexterous grasping for different hands with human-level learning efficiency,” arXiv preprint arXiv:2502.16420, 2025

  26. [26]

    ManipTrans: Efficient dexterous bimanual manipulation transfer via residual learning,

    K. Liet al., “ManipTrans: Efficient dexterous bimanual manipulation transfer via residual learning,” arXiv preprint arXiv:2503.21860, 2025

  27. [27]

    Learning fine-grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inRSS, 2023

  28. [28]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Y . Zeet al., “3D diffusion policy: Generalizable visuomotor policy learn- ing via simple 3D representations,” arXiv preprint arXiv:2403.03954, 2024

  29. [29]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    S. Liuet al., “RDT-1B: A diffusion foundation model for bimanual manipulation,” arXiv preprint arXiv:2410.07864, 2024

  30. [30]

    RoboPanoptes: The all-seeing robot with whole-body dexterity,

    X. Xuet al., “RoboPanoptes: The all-seeing robot with whole-body dexterity,” inRSS, 2025

  31. [31]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Y . Huet al., “Video prediction policy: A generalist robot policy with predictive visual representations,” arXiv preprint arXiv:2412.14803, 2024

  32. [32]

    Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile gripper.arXiv preprint arXiv:2507.15062, 2025

    X. Zhuet al., “Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile gripper,” arXiv preprint arXiv:2507.15062, 2025

  33. [33]

    Memer: Scaling up memory for robot control via experience retrieval.arXiv preprint arXiv:2510.20328, 2025

    A. Sridharet al., “MemER: Scaling up memory for robot control via experience retrieval,” arXiv preprint arXiv:2510.20328, 2025

  34. [34]

    OpenVINS: A research platform for visual-inertial estimation,

    P. Genevaet al., “OpenVINS: A research platform for visual-inertial estimation,” inICRA, 2020, pp. 7726–7732