EmbodimentSemantic is a spatial scene-graph dataset and benchmark for evaluating relational grounding in vision-language models on embodied manipulation trajectories.
org/abs/2506.09930
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
AxisGuide augments RGB images with rendered robot base-frame axis cues to improve generalization of visuomotor manipulation policies under distribution shifts.
RoboSemanticBench reveals that representative VLA models grasp blocks successfully but select the semantically correct answer at near-random rates, indicating a gap between backbone semantics and action prediction.
A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
citing papers explorer
-
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
- LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization