Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

· 2026 · cs.CV · arXiv 2603.11755

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Controllable video generation for complex hand-object interactions is a critical step toward building visual world models. However, existing methods often struggle to achieve fine-grained, 3D-consistent hand articulation in generated videos. By relying on dense 2D trajectories or implicit pose representations, they collapse crucial geometric structures into spatially ambiguous signals, leading to severe motion inconsistencies and hallucinated artifacts under egocentric occlusions. To address this, we propose leveraging sparse 3D hand joints as explicit control signals with three key advantages: explicit geometry to resolve occlusions, an intuitive interface for interactive editing, and cross-embodiment generalization to robotic hands. Built upon this, our efficient control module extracts occlusion-aware features from the source reference frame by penalizing unreliable visual features from hidden joints, and employs a 3D-based weighting mechanism to handle dynamically occluded target joints during motion propagation. Meanwhile, it directly injects 3D geometric embeddings into the latent space to enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline, yielding 1M high-quality egocentric video clips paired with precise hand trajectories. Experiments demonstrate that our approach outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic hand-object interactions.

representative citing papers

HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

cs.CV · 2026-06-05 · unverdicted · novelty 5.0

AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.

citing papers explorer

Showing 2 of 2 citing papers.

HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control cs.CV · 2026-07-02 · unverdicted · none · ref 66 · internal anchor
HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.
AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization cs.CV · 2026-06-05 · unverdicted · none · ref 60 · internal anchor
AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.

Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

fields

years

verdicts

representative citing papers

citing papers explorer