hub Canonical reference

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng · 2025 · cs.RO · arXiv 2507.12440

Canonical reference. 100% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 49 citing papers arXiv PDF

abstract

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 dataset 1

citation-polarity summary

background 12

representative citing papers

EgoGapBench: Benchmarking Egocentric Action Selection in Multi-Agent Scenes

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

EgoGapBench shows humans reliably select egocentric actions in multi-agent scenes while MLLMs systematically choose other agents' actions, and standard egocentric training data fails to close the gap.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

EgoEngine transforms egocentric human videos into high-fidelity robot data enabling zero-shot visuomotor dexterous policy learning without real-robot demonstrations.

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

cs.RO · 2026-05-18 · unverdicted · novelty 7.0

Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7% dexterous success versus 51.7% for baselines.

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.

Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos with Human-Aligned Embodiments

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

Human-as-Humanoid converts ego-exo human videos into executable 60-DoF humanoid actions through embodiment alignment and retargeting, enabling zero-shot real-robot policy deployment without target-task teleoperation data.

Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.

OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation

cs.RO · 2026-06-20 · unverdicted · novelty 6.0

OpenHLM is an empirical recipe yielding a whole-body humanoid VLA model that outperforms GR00T N1.6 and Ψ0 baselines on long-horizon tasks using less than half the demonstration time.

Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data

cs.RO · 2026-06-20 · unverdicted · novelty 6.0

Wh0 generates scalable egocentric human manipulation videos with world models and converts them to boost pretrained VLA models' zero-shot dexterous task success from 8.3% to 38.9% on 18 real-world tasks.

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

cs.RO · 2026-06-17 · unverdicted · novelty 6.0

DO AS I DO reconstructs and retargets hand-object interactions from in-the-wild monocular RGB videos to produce dexterous robot manipulation trajectories, outperforming prior methods on ground-truth and online video datasets.

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

cs.RO · 2026-06-15 · unverdicted · novelty 6.0 · 2 refs

ACE-Ego-0 is a VLA pretraining framework that turns egocentric human videos into robot-format pseudo-actions via a video-to-action pipeline and trains jointly with robot data under a reliability-aware objective.

T-Rex: Tactile-Reactive Dexterous Manipulation

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

T-Rex introduces a large tactile dataset and MoT architecture that achieves over 30% higher success rates than baselines on 12 tasks requiring force control and deformable object handling.

EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment

cs.RO · 2026-06-11 · unverdicted · novelty 6.0

EmbodiSteer steers embodiment-agnostic Cartesian diffusion policies into joint space with Jacobian-based collision guidance after each denoising step for zero-shot cross-embodiment deployment.

$\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full observability.

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

cs.RO · 2026-06-07 · unverdicted · novelty 6.0

EgoPriMo learns a unified egocentric motion prior with a Triple-stream DiT model that supports reconstruction, generation, and forecasting of SMPL motions from egocentric views and text, outperforming prior methods and transferable to humanoid controllers.

SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation

cs.RO · 2026-06-06 · unverdicted · novelty 6.0

SIMPLE is a new large-scale simulation benchmark for humanoid loco-manipulation that integrates accurate dynamics and photorealistic rendering and demonstrates policy transfer from simulation to physical robots.

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

cs.RO · 2026-06-04 · unverdicted · novelty 6.0

Cotraining on 532 everyday human videos with accurate hand labels improves robot policies by 29.7% when networks specialize to human versus robot embodiments.

ActiveMimic: Egocentric Video Pretraining with Active Perception

cs.RO · 2026-06-04 · unverdicted · novelty 6.0

ActiveMimic pretrains on egocentric human video by recovering and modeling active camera motion as viewpoint actions, matching robot-data pretraining performance on real-world tasks.

citing papers explorer

Showing 49 of 49 citing papers.

EgoGapBench: Benchmarking Egocentric Action Selection in Multi-Agent Scenes cs.CV · 2026-07-01 · unverdicted · none · ref 13 · internal anchor
EgoGapBench shows humans reliably select egocentric actions in multi-agent scenes while MLLMs systematically choose other agents' actions, and standard egocentric training data fails to close the gap.
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation cs.RO · 2026-06-11 · unverdicted · none · ref 46 · internal anchor
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations cs.RO · 2026-06-10 · unverdicted · none · ref 46 · internal anchor
EgoEngine transforms egocentric human videos into high-fidelity robot data enabling zero-shot visuomotor dexterous policy learning without real-robot demonstrations.
Dexora: Open-source VLA for High-DoF Bimanual Dexterity cs.RO · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7% dexterous success versus 51.7% for baselines.
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video cs.CV · 2026-05-18 · unverdicted · none · ref 46 · internal anchor
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 106 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 48 · internal anchor
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 111 · internal anchor
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation cs.RO · 2026-07-01 · unverdicted · none · ref 62 · internal anchor
Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.
Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos with Human-Aligned Embodiments cs.RO · 2026-06-30 · unverdicted · none · ref 22 · internal anchor
Human-as-Humanoid converts ego-exo human videos into executable 60-DoF humanoid actions through embodiment alignment and retargeting, enabling zero-shot real-robot policy deployment without target-task teleoperation data.
Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots cs.RO · 2026-06-26 · unverdicted · none · ref 62 · internal anchor
A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.
OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation cs.RO · 2026-06-20 · unverdicted · none · ref 48 · internal anchor
OpenHLM is an empirical recipe yielding a whole-body humanoid VLA model that outperforms GR00T N1.6 and Ψ0 baselines on long-horizon tasks using less than half the demonstration time.
Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data cs.RO · 2026-06-20 · unverdicted · none · ref 5 · internal anchor
Wh0 generates scalable egocentric human manipulation videos with world models and converts them to boost pretrained VLA models' zero-shot dexterous task success from 8.3% to 38.9% on 18 real-world tasks.
Do as I Do: Dexterous Manipulation Data from Everyday Human Videos cs.RO · 2026-06-17 · unverdicted · none · ref 32 · internal anchor
DO AS I DO reconstructs and retargets hand-object interactions from in-the-wild monocular RGB videos to produce dexterous robot manipulation trajectories, outperforming prior methods on ground-truth and online video datasets.
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction cs.CV · 2026-06-17 · unverdicted · none · ref 74 · internal anchor
Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining cs.RO · 2026-06-15 · unverdicted · none · ref 11 · 2 links · internal anchor
ACE-Ego-0 is a VLA pretraining framework that turns egocentric human videos into robot-format pseudo-actions via a video-to-action pipeline and trains jointly with robot data under a reliability-aware objective.
T-Rex: Tactile-Reactive Dexterous Manipulation cs.RO · 2026-06-15 · unverdicted · none · ref 2 · internal anchor
T-Rex introduces a large tactile dataset and MoT architecture that achieves over 30% higher success rates than baselines on 12 tasks requiring force control and deformable object handling.
EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment cs.RO · 2026-06-11 · unverdicted · none · ref 23 · internal anchor
EmbodiSteer steers embodiment-agnostic Cartesian diffusion policies into joint space with Jacobian-based collision guidance after each denoising step for zero-shot cross-embodiment deployment.
$\mu$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models cs.LG · 2026-06-10 · unverdicted · none · ref 71 · internal anchor
Adding recurrent memory tokens to VLA models raises success rates on partially observable manipulation tasks from 0.42 to 0.84 on training and 0.07 to 0.23 on held-out tasks while preserving performance under full observability.
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 32 · internal anchor
MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.
EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control cs.RO · 2026-06-07 · unverdicted · none · ref 9 · internal anchor
EgoPriMo learns a unified egocentric motion prior with a Triple-stream DiT model that supports reconstruction, generation, and forecasting of SMPL motions from egocentric views and text, outperforming prior methods and transferable to humanoid controllers.
SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation cs.RO · 2026-06-06 · unverdicted · none · ref 22 · internal anchor
SIMPLE is a new large-scale simulation benchmark for humanoid loco-manipulation that integrates accurate dynamics and photorealistic rendering and demonstrates policy transfer from simulation to physical robots.
What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos? cs.RO · 2026-06-04 · unverdicted · none · ref 21 · internal anchor
Cotraining on 532 everyday human videos with accurate hand labels improves robot policies by 29.7% when networks specialize to human versus robot embodiments.
ActiveMimic: Egocentric Video Pretraining with Active Perception cs.RO · 2026-06-04 · unverdicted · none · ref 17 · internal anchor
ActiveMimic pretrains on egocentric human video by recovering and modeling active camera motion as viewpoint actions, matching robot-data pretraining performance on real-world tasks.
EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices cs.CV · 2026-05-16 · unverdicted · none · ref 10 · internal anchor
EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.
SCAR: Self-Supervised Continuous Action Representation Learning cs.RO · 2026-05-13 · unverdicted · none · ref 38 · internal anchor
SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation cs.CV · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
GazeVLA: Learning Human Intention for Robotic Manipulation cs.RO · 2026-04-24 · unverdicted · none · ref 72 · internal anchor
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors cs.RO · 2026-04-23 · unverdicted · none · ref 24 · internal anchor
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling cs.RO · 2026-04-21 · unverdicted · none · ref 3 · internal anchor
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration cs.RO · 2026-04-09 · unverdicted · none · ref 20 · internal anchor
ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation cs.RO · 2026-04-09 · unverdicted · none · ref 54 · 2 links · internal anchor
HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World cs.RO · 2026-04-08 · unverdicted · none · ref 55 · internal anchor
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot objectives.
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations cs.RO · 2026-04-08 · unverdicted · none · ref 28 · internal anchor
GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations cs.RO · 2026-03-03 · unverdicted · none · ref 42 · internal anchor
HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.
Play2Perfect: What Matters in Dexterous Play Pretraining for Precise Assembly? cs.RO · 2026-06-24 · unverdicted · none · ref 61 · internal anchor
Play2Perfect uses task-agnostic RL play pretraining on diverse objects to build reusable manipulation priors, then fine-tunes for assembly, yielding 33x sample efficiency gains and 60% success on 0.5mm-clearance insertions in sim-to-real transfer.
PointVG-R: Internalizing Geometric Reasoning in MLLMs for Precise Pointing Localization via Visual Chain of Thought cs.CV · 2026-06-23 · unverdicted · none · ref 108 · internal anchor
PointVG-R is a new MLLM that reaches SOTA on pointing localization by 15.86 mIoU points via a geometric reasoning pipeline, EgoPoint-CoT dataset, SFT, RL, and variance-based reward weighting.
LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation cs.RO · 2026-06-22 · unverdicted · none · ref 60 · internal anchor
LaST-HD creates a shared latent dynamics space via a world model to transfer physical reasoning from scalable human-hand demonstrations to robots, achieving over 90% accuracy with 20 minutes of new data after mixed training.
LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition cs.RO · 2026-06-10 · unverdicted · none · ref 37 · internal anchor
LUCID learns embodiment-agnostic intent models from unstructured human videos to train dexterous robot policies in simulation, enabling zero-shot transfer on real-world tasks like stirring and wiping.
Grounded 3D-Aware Spatial Vision-Language Modeling cs.CV · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
GR3D is a VLM that combines explicit 2D, implicit 2D, and monocular 3D grounding mechanisms to improve performance on spatial understanding benchmarks.
HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos cs.RO · 2026-05-24 · unverdicted · none · ref 12 · internal anchor
HumanEgo reports 92.5% average success on four real robot tasks using only 15-30 minutes of human video per task and zero robot data, with zero-shot transfer to new robots and cameras.
Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum cs.RO · 2026-05-20 · unverdicted · none · ref 47 · internal anchor
A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation cs.RO · 2026-04-27 · unverdicted · none · ref 42 · 2 links · internal anchor
MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.
LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment cs.RO · 2026-04-12 · unverdicted · none · ref 21 · internal anchor
LIDEA bridges the human-robot embodiment gap via implicit feature distillation in 2D and explicit geometry alignment in 3D, enabling human data to substitute up to 80% of robot demonstrations with improved out-of-distribution robustness.
Toward Low-Latency Vision-Language Models with Doubly-Correct Predictions in Egocentric Visual Understanding cs.RO · 2026-06-23 · unverdicted · none · ref 7 · internal anchor
A rationale-informed pruning strategy for VLMs yields higher accuracy and more doubly-correct predictions than prior pruning methods on egocentric video benchmarks.
Robot Self-Improvement via Human-Video Dynamics Models cs.RO · 2026-06-19 · unverdicted · none · ref 11 · internal anchor
Human-video dynamics models enable cross-embodiment robot self-improvement via training-free Dynamics-Guided Action Correction, raising success rates from 40% to 81% on seven real-world tasks.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 204 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks cs.RO · 2026-04-26 · unverdicted · none · ref 45 · internal anchor
EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstrained environments.
From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data cs.RO · 2026-05-18 · unverdicted · none · ref 73 · internal anchor
The paper surveys four classes of techniques that derive action-related supervision from human videos for VLA robot models and identifies three open challenges in episode structuring, embodiment grounding, and evaluation.

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer