pith. sign in

PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it
abstract

3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose a generalizable 3DVG framework, PanoGrounder, that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates strong generalization to unseen 3D datasets and text rephrasings.

fields

cs.CV 3

years

2026 3

verdicts

UNVERDICTED 3

clear filters

representative citing papers

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

OneCanvas aggregates multi-view 3D patches onto one panoramic canvas with continuous angular placement and 3D embeddings, enabling pretrained VLMs to achieve SOTA on SQA3D and VSI-Bench with an order of magnitude less compute via a new spatial pretraining curriculum.

citing papers explorer

Showing 3 of 3 citing papers after filters.

  • PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding cs.CV · 2026-06-30 · unverdicted · none · ref 108 · internal anchor

    PruneGround prunes 3D scenes via language-guided VLM, reformulates descriptions with multi-view reasoning, and adapts a spatial LLM to achieve SOTA 3D visual grounding on ScanRefer and most Nr3D/Sr3D settings.

  • OneCanvas: 3D Scene Understanding via Panoramic Reprojection cs.CV · 2026-06-17 · unverdicted · none · ref 8 · internal anchor

    OneCanvas aggregates multi-view 3D patches onto one panoramic canvas with continuous angular placement and 3D embeddings, enabling pretrained VLMs to achieve SOTA on SQA3D and VSI-Bench with an order of magnitude less compute via a new spatial pretraining curriculum.

  • Panoramic Scene Analysis: A Survey from Distortion-Aware Engineering to Sphere-Native Foundation Modeling cs.CV · 2026-06-26 · unverdicted · none · ref 50 · internal anchor

    Survey organizing panoramic scene analysis literature by architectural design and training paradigm, identifying the absence of methods achieving both strict spherical equivariance and full reuse of perspective-pretrained weights, plus five evaluation protocol gaps and a six-point roadmap.