Rethinking the practicality of vision-language-action model: A comprehensive benchmark and an improved baseline

Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao, Tongxin Wang, Pengxu Hou, Zhide Zhong, Haodong Yan, Donglin Wang, Jun Ma, Haoang Li · 2026 · arXiv 2602.22663

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

read on arXiv browse 1 citing papers

representative citing papers

AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

cs.RO · 2026-04-13 · unverdicted · novelty 6.0

AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.

citing papers explorer

Showing 1 of 1 citing paper.

AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps cs.RO · 2026-04-13 · unverdicted · none · ref 18
AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.

Rethinking the practicality of vision-language-action model: A comprehensive benchmark and an improved baseline

fields

years

verdicts

representative citing papers

citing papers explorer