Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
On the generalization capacities of mllms for spatial intelligence.arXiv preprint arXiv:2603.06704,
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Survey organizing panoramic scene analysis literature by architectural design and training paradigm, identifying the absence of methods achieving both strict spherical equivariance and full reuse of perspective-pretrained weights, plus five evaluation protocol gaps and a six-point roadmap.
citing papers explorer
-
VLM3: Vision Language Models Are Native 3D Learners
Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.
-
Panoramic Scene Analysis: A Survey from Distortion-Aware Engineering to Sphere-Native Foundation Modeling
Survey organizing panoramic scene analysis literature by architectural design and training paradigm, identifying the absence of methods achieving both strict spherical equivariance and full reuse of perspective-pretrained weights, plus five evaluation protocol gaps and a six-point roadmap.