Spatial-aware vla pretraining through visual-physical alignment from human videos

Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu · 2025 · arXiv 2512.13080

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

cs.RO · 2026-04-27 · unverdicted · novelty 5.0 · 2 refs

MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.

citing papers explorer

Showing 3 of 3 citing papers.

Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 101
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-04-20 · unverdicted · none · ref 29
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation cs.RO · 2026-04-27 · unverdicted · none · ref 12 · 2 links
MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.

Spatial-aware vla pretraining through visual-physical alignment from human videos

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer