SFHand presents the first streaming language-guided autoregressive framework for 3D hand forecasting, achieving up to 35.8% gains over prior methods and 13.4% better downstream embodied task performance.
On pre-training for visuo-motor control: Re- visiting a learning-from-scratch baseline.arXiv preprint arXiv:2212.05749, 2022
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.
Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
An end-to-end policy learns robust humanoid locomotion directly from noisy depth images via high-fidelity sensor simulation, vision-aware distillation from privileged maps, and terrain-specific multi-critic reward shaping.
citing papers explorer
-
SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting
SFHand presents the first streaming language-guided autoregressive framework for 3D hand forecasting, achieving up to 35.8% gains over prior methods and 13.4% better downstream embodied task performance.
-
IGen: Scalable Data Generation for Robot Learning from Open-World Images
IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.
-
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
-
Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
An end-to-end policy learns robust humanoid locomotion directly from noisy depth images via high-fidelity sensor simulation, vision-aware distillation from privileged maps, and terrain-specific multi-critic reward shaping.