ReVSI rebuilds 3D spatial reasoning benchmarks for VLMs by re-annotating objects and geometry across 381 scenes and creating verified QA pairs that match actual model inputs like 16-64 frames.
These are images of an object. What is the name of the object?
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
citing papers explorer
-
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
ReVSI rebuilds 3D spatial reasoning benchmarks for VLMs by re-annotating objects and geometry across 381 scenes and creating verified QA pairs that match actual model inputs like 16-64 frames.
-
MiMo-Embodied: X-Embodied Foundation Model Technical Report
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
-
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.