pith. sign in

arxiv: 2512.20907 · v2 · pith:R6NIKHHTnew · submitted 2025-12-24 · 💻 cs.CV

PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

classification 💻 cs.CV
keywords panoramicvision-languagereasoningscenevlmsdatasetsgeneralizationgeometry
0
0 comments X
read the original abstract

3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose a generalizable 3DVG framework, PanoGrounder, that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates strong generalization to unseen 3D datasets and text rephrasings.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding

    cs.CV 2026-06 unverdicted novelty 6.0

    PruneGround prunes 3D scenes via language-guided VLM, reformulates descriptions with multi-view reasoning, and adapts a spatial LLM to achieve SOTA 3D visual grounding on ScanRefer and most Nr3D/Sr3D settings.

  2. Panoramic Scene Analysis: A Survey from Distortion-Aware Engineering to Sphere-Native Foundation Modeling

    cs.CV 2026-06 unverdicted novelty 3.0

    Survey organizing panoramic scene analysis literature by architectural design and training paradigm, identifying the absence of methods achieving both strict spherical equivariance and full reuse of perspective-pretra...