Wh0 generates scalable egocentric human manipulation videos with world models and converts them to boost pretrained VLA models' zero-shot dexterous task success from 8.3% to 38.9% on 18 real-world tasks.
Dexterous world models.arXiv preprint arXiv:2512.17907, 2025
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
DexSIM is a bi-directional video diffusion model with hand trajectory embedding and spatial memory cache for real-time dexterous hand-object simulation at 15 FPS.
DeWorldSG improves 3D scene graph generation from RGB-D sequences by using depth-guided 3D Gaussian object nodes and V-JEPA 2 world-model priors for spatiotemporal relation refinement, reporting large recall gains on 3DSSG and ReplicaSSG.
AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.
citing papers explorer
-
Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data
Wh0 generates scalable egocentric human manipulation videos with world models and converts them to boost pretrained VLA models' zero-shot dexterous task success from 8.3% to 38.9% on 18 real-world tasks.
-
DexSIM: Real-time Dexterous Simulation with Unified Causal Video Diffusion
DexSIM is a bi-directional video diffusion model with hand trajectory embedding and spatial memory cache for real-time dexterous hand-object simulation at 15 FPS.
-
DeWorldSG: Depth-Aware 3D Semantic Scene Graph Generation via World-Model Priors
DeWorldSG improves 3D scene graph generation from RGB-D sequences by using depth-guided 3D Gaussian object nodes and V-JEPA 2 world-model priors for spatiotemporal relation refinement, reporting large recall gains on 3DSSG and ReplicaSSG.
-
AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization
AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.