hub Canonical reference

Masquerade: Learning from in-the-wild human videos using data-editing

Marion Lepert, Jiaying Fang, Jeannette Bohg · 2025 · cs.RO · arXiv 2508.09976

Canonical reference. 71% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 15 citing papers arXiv PDF

abstract

Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into robotized demonstrations by (i) estimating 3-D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. Pre-training a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips, and continuing that auxiliary loss while fine-tuning a diffusion policy head on only 50 robot demonstrations per task, yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6x. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 1 other 1

citation-polarity summary

background 5 unclear 1 use method 1

representative citing papers

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

cs.RO · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

MonoDuo: Using One Robot Arm to Learn Bimanual Policies

cs.RO · 2026-05-28 · unverdicted · novelty 6.0

MonoDuo generates synthetic bimanual demonstrations from single-arm teleoperation plus human collaboration to train policies achieving up to 70% zero-shot success on five manipulation tasks, with 65-70% gains from 25-shot finetuning.

LACE: Latent Visual Representation for Cross-Embodiment Learning

cs.RO · 2026-05-16 · unverdicted · novelty 6.0

LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

cs.RO · 2026-05-05 · unverdicted · novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from single human demonstrations without paired data.

Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

cs.RO · 2026-04-25 · unverdicted · novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist policies.

GazeVLA: Learning Human Intention for Robotic Manipulation

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

cs.RO · 2026-04-15 · unverdicted · novelty 6.0

Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.

WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

cs.RO · 2026-04-12 · unverdicted · novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match teleoperation success rates on five tabletop tasks with 5-8x less collection effort.

ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration

cs.RO · 2026-04-09 · unverdicted · novelty 6.0

ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

cs.RO · 2025-11-06 · unverdicted · novelty 6.0

X-Diffusion adapts Ambient Diffusion to selectively train on noised human actions for cross-embodiment robot policies, yielding 16% higher average success rates than naive co-training or manual filtering across five real-world manipulation tasks.

CORE: Common Outcome Regularities from Action-Free Visual Demonstrations for Robot Manipulation

cs.RO · 2026-06-28 · unverdicted · novelty 5.0

CORE extracts visual goal prototypes from terminal embeddings in action-free demonstrations to condition robot policies, reporting success rate gains of up to 17 percentage points on manipulation benchmarks.

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

cs.RO · 2026-05-24 · unverdicted · novelty 5.0

HumanEgo reports 92.5% average success on four real robot tasks using only 15-30 minutes of human video per task and zero robot data, with zero-shot transfer to new robots and cameras.

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

cs.CV · 2026-04-30 · unverdicted · novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.

LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment

cs.RO · 2026-04-12 · unverdicted · novelty 5.0

LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-distribution robustness.

citing papers explorer

Showing 1 of 1 citing paper after filters.

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations cs.RO · 2025-11-06 · unverdicted · none · ref 9 · internal anchor
X-Diffusion adapts Ambient Diffusion to selectively train on noised human actions for cross-embodiment robot policies, yielding 16% higher average success rates than naive co-training or manual filtering across five real-world manipulation tasks.

Masquerade: Learning from in-the-wild human videos using data-editing

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer