Develop- ing vision-language-action model from egocentric videos

Tomoya Yoshida, Shuhei Kurita, et al · 2025 · arXiv 2509.21986

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

cs.RO · 2026-06-01 · unverdicted · novelty 6.0

AFUN predicts task-conditional functional masks and 3D post-contact motion curves from RGB-D and language, trained via a standardized multi-source data pipeline, and reports large gains over baselines on segmentation, contact prediction, and motion tasks.

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

cs.RO · 2026-05-18 · unverdicted · novelty 3.0

The paper surveys four classes of techniques that derive action-related supervision from human videos for VLA robot models and identifies three open challenges in episode structuring, embodiment grounding, and evaluation.

citing papers explorer

Showing 2 of 2 citing papers after filters.

AFUN: Towards an Affordance Foundation Model for Functionality Understanding cs.RO · 2026-06-01 · unverdicted · none · ref 84
AFUN predicts task-conditional functional masks and 3D post-contact motion curves from RGB-D and language, trained via a standardized multi-source data pipeline, and reports large gains over baselines on segmentation, contact prediction, and motion tasks.
From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data cs.RO · 2026-05-18 · unverdicted · none · ref 75
The paper surveys four classes of techniques that derive action-related supervision from human videos for VLA robot models and identifies three open challenges in episode structuring, embodiment grounding, and evaluation.

Develop- ing vision-language-action model from egocentric videos

fields

years

verdicts

representative citing papers

citing papers explorer