VLMs fail to ground numerical values in spatial perception on new bidirectional tasks, relying on shallow cues instead of coordinate-aware representations.
Spatialreasoner: To- wards explicit and generalizable 3d spatial reasoning
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 2polarities
background 2representative citing papers
SpatiO uses heterogeneous vision-language agents with test-time orchestration to dynamically weight their contributions for improved spatial reasoning on benchmarks like 3DSRBench and CV-Bench.
Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
PASR performs pose-aware analysis-by-synthesis by aligning 3D projections with DINOv3 patch features, outperforming prior methods on clean and occluded retrieval while also handling pose estimation and classification.
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
citing papers explorer
-
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
VLMs fail to ground numerical values in spatial perception on new bidirectional tasks, relying on shallow cues instead of coordinate-aware representations.
-
SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
SpatiO uses heterogeneous vision-language agents with test-time orchestration to dynamically weight their contributions for improved spatial reasoning on benchmarks like 3DSRBench and CV-Bench.
-
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies
Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
-
PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
PASR performs pose-aware analysis-by-synthesis by aligning 3D projections with DINOv3 patch features, outperforming prior methods on clean and occluded retrieval while also handling pose estimation and classification.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.
- OpenWorldLib: A Unified Codebase and Definition of Advanced World Models