SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

Brendan Crowe; Christoffer Heckman; Doncey Albin; Xuefei Sun; Xujia Zhang

arxiv: 2605.21788 · v1 · pith:EI4WX4QYnew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

Xuefei Sun , Xujia Zhang , Brendan Crowe , Doncey Albin , Christoffer Heckman This is my paper

Pith reviewed 2026-05-22 08:43 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords zero-shot 3D visual groundingscene graph matchingvision-language modelsRGB-D inputsrobot deploymentspatial reasoningScanRefer benchmark

0 comments

The pith

Reformulating 3D visual grounding as structured graph matching on a scene graph built from 2D views enables competitive zero-shot localization from natural language using only RGB-D inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework called SceneGraphGrounder that turns the problem of finding objects in 3D space from free-form language descriptions into a graph alignment task. It first uses a vision-language model prompted with visual markers to extract relationships between objects across multiple 2D RGB-D views, then lifts those relations into a single consistent 3D scene graph that stores both spatial positions and semantic connections. For any given language query, the method builds a corresponding query graph and finds the best constrained match inside the scene graph. A sympathetic reader would care because this yields interpretable, multi-view consistent results without requiring training on grounding-specific datasets, and the approach is shown to transfer from benchmarks to physical robot operation in extended real environments.

Core claim

SceneGraphGrounder reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. A visual marker prompting strategy enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, a query graph is constructed and aligned with the scene graph under constraints that enforce multi-view consistency and interpretable reasoning.

What carries the argument

Constrained alignment between a language-derived query graph and a persistent 3D scene graph whose edges and nodes are populated by lifting VLM-inferred relations from multiple 2D RGB-D views.

If this is right

The method reaches competitive accuracy among zero-shot approaches on the ScanRefer benchmark while using only RGB-D sensor data.
The same pipeline supports direct deployment on a mobile robot and maintains spatial reasoning across long sequences of actions in physical space.
Reasoning remains interpretable because every alignment step operates on explicit graph edges rather than implicit feature vectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the graph with temporal edges could allow the same matching process to track objects across time without separate tracking modules.
The explicit graph representation may make it easier to incorporate additional constraints such as physics or commonsense rules during alignment.
Because the method separates scene construction from query matching, it could be combined with faster 3D reconstruction pipelines to reduce latency in real-time settings.

Load-bearing premise

Relationships detected by the vision-language model in separate 2D images can be lifted into one 3D scene graph that remains free of contradictions when the same objects are seen from different angles.

What would settle it

A direct test would be to check whether the constructed 3D scene graph assigns conflicting spatial or semantic relations to the same pair of objects when the input views are rotated or reordered; systematic conflicts would show the lifting step does not produce a reliable persistent representation.

Figures

Figures reproduced from arXiv: 2605.21788 by Brendan Crowe, Christoffer Heckman, Doncey Albin, Xuefei Sun, Xujia Zhang.

**Figure 1.** Figure 1: System Overview. Our framework takes RGB-D images, sensor odometry, and a user query as input. The graph is lifted into 3D via visual prompting and association with the reconstructed point cloud, forming a global 3D scene graph. Object grounding is achieved via graph matching with the parsed query. improve generalization to unseen objects and predicates by leveraging large VLMs and language priors. Within … view at source ↗

**Figure 2.** Figure 2: Qualitative results. Left: ScanRefer. Right: real-world robot experiments. Rendered images highlight ground truth objects ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Robot experiment setup. Left: bird’s-eye view of trajectories taken by our quadrupedal robotics platform. Right: the platform and onboard sensing suite used in our experiments. VI. RESULTS A. Main Results on ScanRefer Table I presents the performance of our method on the ScanRefer validation set under both unique and multiple object splits. We compare against a wide range of baselines, including fully supe… view at source ↗

**Figure 4.** Figure 4: Performance analysis on ScanRefer. backbone models, our framework consistently improves performance, showing that stronger multimodal reasoning benefits 3D grounding under limited geometric cues. 1) Unique vs. Multiple Splits: A consistent trend across all methods is the significant performance gap between the unique and multiple splits. As shown in Table I, performance on unique scenes is substantially h… view at source ↗

**Figure 6.** Figure 6: Real-robot Experiment: (a) relationship-type distribution, and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Statistical analysis of the ScanRefer test split: (a) VLM usage [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SceneGraphGrounder, a zero-shot 3D visual grounding framework that reformulates the task as constrained graph matching between a query graph and a reconstructed 3D scene graph. The core technical contribution is a visual marker prompting strategy that elicits object-object spatial and semantic relations from a VLM on 2D RGB-D views; these relations are lifted into a persistent 3D scene graph. Given a natural-language query, the method builds a corresponding query graph and performs alignment to localize the target object. Experiments are reported on the ScanRefer benchmark (claiming competitive results among zero-shot methods using only RGB-D input) together with real-robot deployment on a mobile platform for long-horizon tasks.

Significance. If the relation-lifting step proves reliable, the explicit scene-graph formulation could improve interpretability and multi-view consistency relative to direct VLM reasoning, especially for compositional queries. The real-world robot validation is a concrete strength that demonstrates practical utility beyond simulation benchmarks. The approach also supplies a clear, modular pipeline that could be extended or ablated in future work.

major comments (2)

[Abstract] Abstract: the claim of 'competitive performance among zero-shot approaches' on ScanRefer is presented without any quantitative numbers, baseline comparisons, or error breakdown. Because the central empirical claim rests on this result, the absence of these data prevents verification of whether the graph-matching formulation actually delivers the advertised gains.
[Method (scene-graph construction)] The lifting procedure (visual marker prompting followed by 2D-to-3D relation transfer) is described as producing a 'persistent 3D scene graph' without any stated mechanism for detecting or resolving cross-view inconsistencies or depth-verified conflicts. Because VLMs are known to generate view-dependent spatial hallucinations, this step is load-bearing for the multi-view consistency and real-robot robustness claims; its correctness must be demonstrated with explicit validation metrics or conflict-resolution logic.

minor comments (2)

The manuscript states that code will be released upon acceptance; adding a footnote or repository link in the camera-ready version would improve reproducibility.
[Figure 1] Ensure that any diagram of the overall pipeline explicitly annotates the lifting and consistency-enforcement stages so readers can trace how 2D inferences become 3D relations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'competitive performance among zero-shot approaches' on ScanRefer is presented without any quantitative numbers, baseline comparisons, or error breakdown. Because the central empirical claim rests on this result, the absence of these data prevents verification of whether the graph-matching formulation actually delivers the advertised gains.

Authors: We agree that the abstract would benefit from including specific quantitative results to support the performance claim. In the revised manuscript, we will add a concise statement with key metrics (such as the grounding accuracy on ScanRefer and direct comparison to other zero-shot RGB-D methods) along with a reference to the full baseline table and error analysis in the Experiments section. This change will make the central empirical contribution immediately verifiable from the abstract. revision: yes
Referee: [Method (scene-graph construction)] The lifting procedure (visual marker prompting followed by 2D-to-3D relation transfer) is described as producing a 'persistent 3D scene graph' without any stated mechanism for detecting or resolving cross-view inconsistencies or depth-verified conflicts. Because VLMs are known to generate view-dependent spatial hallucinations, this step is load-bearing for the multi-view consistency and real-robot robustness claims; its correctness must be demonstrated with explicit validation metrics or conflict-resolution logic.

Authors: We acknowledge that the current Method section does not explicitly describe mechanisms for handling cross-view inconsistencies in the relation-lifting process. To address this, we will expand the scene-graph construction subsection to include our conflict-resolution logic: depth-verified consistency checks across overlapping views combined with a simple voting scheme to filter view-dependent hallucinations. We will also add validation metrics (e.g., conflict resolution rate and consistency scores on held-out multi-view sequences) to empirically support the persistence and robustness claims. These additions will be placed in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Reformulation as graph matching with VLM prompting shows no circular reduction to inputs

full rationale

The paper reformulates 3D visual grounding as constrained alignment between a query graph and a 3D scene graph constructed by lifting VLM-inferred relations from 2D views via visual marker prompting. No equations, derivations, or fitted parameters are presented that reduce the claimed competitive zero-shot performance on ScanRefer or the robot deployment results to self-referential definitions or by-construction predictions. The central claims rest on the external capabilities of VLMs and standard graph matching, which are treated as independent inputs rather than outputs of the method itself. No self-citation chains or uniqueness theorems are invoked to force the framework's validity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified ability of current VLMs to produce accurate, liftable object relations from 2D views and on the assumption that graph matching will remain robust in real-world long-horizon settings; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption VLM can infer accurate object-object relationships from 2D views that lift consistently to 3D
Invoked in the description of the visual marker prompting strategy and subsequent lifting step.

pith-pipeline@v0.9.0 · 5735 in / 1357 out tokens · 39668 ms · 2026-05-22T08:43:59.574142+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph... visual marker prompting strategy that enables a VLM to infer object–object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

[1]

Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent,

J. Yang, X. Chen, S. Qian, N. Madaan, M. Iyengar, D. F. Fouhey, and J. Chai, “Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent,” 2023. [Online]. Available: https://arxiv.org/abs/2309.12311

work page arXiv 2023
[2]

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,

R. Xu, Z. Huang, T. Wang, Y . Chen, J. Pang, and D. Lin, “Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,” 2024. [Online]. Available: https://arxiv.org/abs/2410.13860

work page arXiv 2024
[3]

Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,

R. Li, S. Li, L. Kong, X. Yang, and J. Liang, “Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,” 2025. [Online]. Available: https://arxiv.org/abs/2412.04383

work page arXiv 2025
[4]

Visual programming for zero-shot open-vocabulary 3d visual grounding,

Z. Yuan, J. Ren, C.-M. Feng, H. Zhao, S. Cui, and Z. Li, “Visual programming for zero-shot open-vocabulary 3d visual grounding,”

work page
[5]

Available: https://arxiv.org/abs/2311.15383

[Online]. Available: https://arxiv.org/abs/2311.15383

work page arXiv
[6]

Lerf: Language embedded radiance fields,

J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 19 729–19 739

work page 2023
[7]

Openscene: 3d scene understanding with open vocabularies,

S. Peng, K. Genova, C. M. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “Openscene: 3d scene understanding with open vocabularies,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[8]

Task and motion planning in hierarchical 3d scene graphs,

A. Ray, C. Bradley, L. Carlone, and N. Roy, “Task and motion planning in hierarchical 3d scene graphs,” 2024. [Online]. Available: https://arxiv.org/abs/2403.08094

work page arXiv 2024
[9]

Spatial amr: Expanded spatial annotation in the context of a grounded minecraft cor- pus,

J. Bonn, M. Palmer, Z. Cai, and K. Wright-Bettner, “Spatial amr: Expanded spatial annotation in the context of a grounded minecraft cor- pus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4883–4892

work page 2020
[10]

Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,

R. Paul, J. Arkin, N. Roy, and T. M Howard, “Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,” 2016

work page 2016
[11]

Llava- spacesgg: Visual instruct tuning for open-vocabulary scene graph generation with enhanced spatial relations,

M. Xu, M. Wu, Y . Zhao, J. C. L. Li, and W. Ou, “Llava- spacesgg: Visual instruct tuning for open-vocabulary scene graph generation with enhanced spatial relations,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06322

work page arXiv 2024
[12]

Visually-prompted language model for fine-grained scene graph generation in an open world,

Q. Yu, J. Li, Y . Wu, S. Tang, W. Ji, and Y . Zhuang, “Visually-prompted language model for fine-grained scene graph generation in an open world,” 2023. [Online]. Available: https://arxiv.org/abs/2303.13233

work page arXiv 2023
[13]

Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual- concept alignment and retention,

Z. Chen, J. Wu, Z. Lei, Z. Zhang, and C. Chen, “Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual- concept alignment and retention,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 108–124

work page 2024
[14]

Relation-aware hierarchical prompt for open-vocabulary scene graph generation,

T. Liu, R. Li, C. Wang, and X. He, “Relation-aware hierarchical prompt for open-vocabulary scene graph generation,” 2025. [Online]. Available: https://arxiv.org/abs/2412.19021

work page arXiv 2025
[15]

Scene graph generation with role-playing large language models,

G. Chen, J. Li, and W. Wang, “Scene graph generation with role-playing large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2410.15364

work page arXiv 2024
[16]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Learning 3d semantic scene graphs from 3d indoor reconstructions,

J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” 2020. [Online]. Available: https://arxiv.org/abs/2004.03967

work page arXiv 2020
[18]

Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud,

Z. Wang, B. Cheng, L. Zhao, D. Xu, Y . Tang, and L. Sheng, “Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud,” 2023. [Online]. Available: https://arxiv.org/abs/2303.14408

work page arXiv 2023
[19]

Exploiting edge-oriented reasoning for 3d point-based scene graph analysis,

C. Zhang, J. Yu, Y . Song, and W. Cai, “Exploiting edge-oriented reasoning for 3d point-based scene graph analysis,” 2021. [Online]. Available: https://arxiv.org/abs/2103.05558

work page arXiv 2021
[20]

Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,

S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,” 2021. [Online]. Available: https://arxiv.org/abs/2103.14898

work page arXiv 2021
[21]

Incremental 3d semantic scene graph prediction from rgb sequences,

S.-C. Wu, K. Tateno, N. Navab, and F. Tombari, “Incremental 3d semantic scene graph prediction from rgb sequences,” 2023. [Online]. Available: https://arxiv.org/abs/2305.02743

work page arXiv 2023
[22]

Exploiting contextual objects and relations for 3d visual grounding,

L. Yang, Z. Zhang, Z. Qi, Y . Xu, W. Liu, Y . Shan, B. Li, W. Yang, P. Li, Y . Wanget al., “Exploiting contextual objects and relations for 3d visual grounding,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[23]

Knowledge-inspired 3d scene graph prediction in point cloud,

S. Zhang, s. li, A. Hao, and H. Qin, “Knowledge-inspired 3d scene graph prediction in point cloud,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 18 620– 18 632. [Online]. Available: https://proceedings.neurips.cc/paper files/ pap...

work page 2021
[24]

3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud,

M. Feng, H. Hou, L. Zhang, Z. Wu, Y . Guo, and A. Mian, “3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9182–9191

work page 2023
[25]

Scanrefer: 3d object localization in rgb-d scans using natural language,

D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” 2020. [Online]. Available: https://arxiv.org/abs/1912.08830

work page arXiv 2020
[26]

Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,

P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 422–440

work page 2020
[27]

3d-vista: Pre-trained transformer for 3d vision and text alignment,

Z. Ziyu, M. Xiaojian, C. Yixin, D. Zhidong, H. Siyuan, and L. Qing, “3d-vista: Pre-trained transformer for 3d vision and text alignment,” in ICCV, 2023

work page 2023
[28]

Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,

Z. Yuan, X. Yan, Y . Liao, R. Zhang, S. Wang, Z. Li, and S. Cui, “Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1791–1800

work page 2021
[29]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” 2023. [Online]. Available: https://arxiv.org/abs/2309.16650

work page arXiv 2023
[30]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

work page 2023
[31]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,” 2023. [Online]. Available: https://arxiv.org/abs/2310.11441

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

3DVG-Transformer: Relation modeling for visual grounding on point clouds,

L. Zhao, D. Cai, L. Sheng, and D. Xu, “3DVG-Transformer: Relation modeling for visual grounding on point clouds,” inICCV, 2021, pp. 2928–2937

work page 2021
[33]

Bottom up top down detection transformers for language grounding in images and point clouds,

A. Jain, N. Gkanatsios, I. Mediratta, and K. Fragkiadaki, “Bottom up top down detection transformers for language grounding in images and point clouds,” 2022. [Online]. Available: https://arxiv.org/abs/2112.08879

work page arXiv 2022
[34]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Y . Wu, X. Cheng, R. Zhang, Z. Cheng, and J. Zhang, “Eda: Explicit text-decoupling and dense alignment for 3d visual grounding,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023, p. 19231–19242. [Online]. Available: http://dx.doi.org/10.1109/CVPR52729.2023.01843

work page doi:10.1109/cvpr52729.2023.01843 2023
[35]

Gˆ 3-lq: Marrying hyperbolic alignment with explicit semantic-geometric modeling for 3d visual grounding,

Y . Wang, Y . Li, and S. Wang, “Gˆ 3-lq: Marrying hyperbolic alignment with explicit semantic-geometric modeling for 3d visual grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 917–13 926

work page 2024
[36]

Multi-branch collaborative learning network for 3d visual grounding,

Z. Qian, Y . Ma, Z. Lin, J. Ji, X. Zheng, X. Sun, and R. Ji, “Multi-branch collaborative learning network for 3d visual grounding,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 381–398

work page 2024
[37]

Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding,

Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2662–2671

work page 2023
[38]

Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,

S. Linok, T. Zemskova, S. Ladanova, R. Titkov, D. Yudin, M. Monastyrny, and A. Valenkov, “Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,” 2025. [Online]. Available: https://arxiv.org/abs/2406.07113

work page arXiv 2025
[39]

ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” 2017. [Online]. Available: https://arxiv.org/abs/1702.04405

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Ultralytics yolov8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

work page 2023
[41]

Qwen3 Technical Report

A. Yang, A. Liet al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent,

J. Yang, X. Chen, S. Qian, N. Madaan, M. Iyengar, D. F. Fouhey, and J. Chai, “Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent,” 2023. [Online]. Available: https://arxiv.org/abs/2309.12311

work page arXiv 2023

[2] [2]

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,

R. Xu, Z. Huang, T. Wang, Y . Chen, J. Pang, and D. Lin, “Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,” 2024. [Online]. Available: https://arxiv.org/abs/2410.13860

work page arXiv 2024

[3] [3]

Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,

R. Li, S. Li, L. Kong, X. Yang, and J. Liang, “Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,” 2025. [Online]. Available: https://arxiv.org/abs/2412.04383

work page arXiv 2025

[4] [4]

Visual programming for zero-shot open-vocabulary 3d visual grounding,

Z. Yuan, J. Ren, C.-M. Feng, H. Zhao, S. Cui, and Z. Li, “Visual programming for zero-shot open-vocabulary 3d visual grounding,”

work page

[5] [5]

Available: https://arxiv.org/abs/2311.15383

[Online]. Available: https://arxiv.org/abs/2311.15383

work page arXiv

[6] [6]

Lerf: Language embedded radiance fields,

J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 19 729–19 739

work page 2023

[7] [7]

Openscene: 3d scene understanding with open vocabularies,

S. Peng, K. Genova, C. M. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “Openscene: 3d scene understanding with open vocabularies,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[8] [8]

Task and motion planning in hierarchical 3d scene graphs,

A. Ray, C. Bradley, L. Carlone, and N. Roy, “Task and motion planning in hierarchical 3d scene graphs,” 2024. [Online]. Available: https://arxiv.org/abs/2403.08094

work page arXiv 2024

[9] [9]

Spatial amr: Expanded spatial annotation in the context of a grounded minecraft cor- pus,

J. Bonn, M. Palmer, Z. Cai, and K. Wright-Bettner, “Spatial amr: Expanded spatial annotation in the context of a grounded minecraft cor- pus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4883–4892

work page 2020

[10] [10]

Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,

R. Paul, J. Arkin, N. Roy, and T. M Howard, “Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,” 2016

work page 2016

[11] [11]

Llava- spacesgg: Visual instruct tuning for open-vocabulary scene graph generation with enhanced spatial relations,

M. Xu, M. Wu, Y . Zhao, J. C. L. Li, and W. Ou, “Llava- spacesgg: Visual instruct tuning for open-vocabulary scene graph generation with enhanced spatial relations,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06322

work page arXiv 2024

[12] [12]

Visually-prompted language model for fine-grained scene graph generation in an open world,

Q. Yu, J. Li, Y . Wu, S. Tang, W. Ji, and Y . Zhuang, “Visually-prompted language model for fine-grained scene graph generation in an open world,” 2023. [Online]. Available: https://arxiv.org/abs/2303.13233

work page arXiv 2023

[13] [13]

Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual- concept alignment and retention,

Z. Chen, J. Wu, Z. Lei, Z. Zhang, and C. Chen, “Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual- concept alignment and retention,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 108–124

work page 2024

[14] [14]

Relation-aware hierarchical prompt for open-vocabulary scene graph generation,

T. Liu, R. Li, C. Wang, and X. He, “Relation-aware hierarchical prompt for open-vocabulary scene graph generation,” 2025. [Online]. Available: https://arxiv.org/abs/2412.19021

work page arXiv 2025

[15] [15]

Scene graph generation with role-playing large language models,

G. Chen, J. Li, and W. Wang, “Scene graph generation with role-playing large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2410.15364

work page arXiv 2024

[16] [16]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

Learning 3d semantic scene graphs from 3d indoor reconstructions,

J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” 2020. [Online]. Available: https://arxiv.org/abs/2004.03967

work page arXiv 2020

[18] [18]

Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud,

Z. Wang, B. Cheng, L. Zhao, D. Xu, Y . Tang, and L. Sheng, “Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud,” 2023. [Online]. Available: https://arxiv.org/abs/2303.14408

work page arXiv 2023

[19] [19]

Exploiting edge-oriented reasoning for 3d point-based scene graph analysis,

C. Zhang, J. Yu, Y . Song, and W. Cai, “Exploiting edge-oriented reasoning for 3d point-based scene graph analysis,” 2021. [Online]. Available: https://arxiv.org/abs/2103.05558

work page arXiv 2021

[20] [20]

Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,

S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,” 2021. [Online]. Available: https://arxiv.org/abs/2103.14898

work page arXiv 2021

[21] [21]

Incremental 3d semantic scene graph prediction from rgb sequences,

S.-C. Wu, K. Tateno, N. Navab, and F. Tombari, “Incremental 3d semantic scene graph prediction from rgb sequences,” 2023. [Online]. Available: https://arxiv.org/abs/2305.02743

work page arXiv 2023

[22] [22]

Exploiting contextual objects and relations for 3d visual grounding,

L. Yang, Z. Zhang, Z. Qi, Y . Xu, W. Liu, Y . Shan, B. Li, W. Yang, P. Li, Y . Wanget al., “Exploiting contextual objects and relations for 3d visual grounding,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[23] [23]

Knowledge-inspired 3d scene graph prediction in point cloud,

S. Zhang, s. li, A. Hao, and H. Qin, “Knowledge-inspired 3d scene graph prediction in point cloud,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 18 620– 18 632. [Online]. Available: https://proceedings.neurips.cc/paper files/ pap...

work page 2021

[24] [24]

3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud,

M. Feng, H. Hou, L. Zhang, Z. Wu, Y . Guo, and A. Mian, “3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9182–9191

work page 2023

[25] [25]

Scanrefer: 3d object localization in rgb-d scans using natural language,

D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” 2020. [Online]. Available: https://arxiv.org/abs/1912.08830

work page arXiv 2020

[26] [26]

Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,

P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 422–440

work page 2020

[27] [27]

3d-vista: Pre-trained transformer for 3d vision and text alignment,

Z. Ziyu, M. Xiaojian, C. Yixin, D. Zhidong, H. Siyuan, and L. Qing, “3d-vista: Pre-trained transformer for 3d vision and text alignment,” in ICCV, 2023

work page 2023

[28] [28]

Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,

Z. Yuan, X. Yan, Y . Liao, R. Zhang, S. Wang, Z. Li, and S. Cui, “Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1791–1800

work page 2021

[29] [29]

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” 2023. [Online]. Available: https://arxiv.org/abs/2309.16650

work page arXiv 2023

[30] [30]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

work page 2023

[31] [31]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,” 2023. [Online]. Available: https://arxiv.org/abs/2310.11441

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

3DVG-Transformer: Relation modeling for visual grounding on point clouds,

L. Zhao, D. Cai, L. Sheng, and D. Xu, “3DVG-Transformer: Relation modeling for visual grounding on point clouds,” inICCV, 2021, pp. 2928–2937

work page 2021

[33] [33]

Bottom up top down detection transformers for language grounding in images and point clouds,

A. Jain, N. Gkanatsios, I. Mediratta, and K. Fragkiadaki, “Bottom up top down detection transformers for language grounding in images and point clouds,” 2022. [Online]. Available: https://arxiv.org/abs/2112.08879

work page arXiv 2022

[34] [34]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Y . Wu, X. Cheng, R. Zhang, Z. Cheng, and J. Zhang, “Eda: Explicit text-decoupling and dense alignment for 3d visual grounding,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023, p. 19231–19242. [Online]. Available: http://dx.doi.org/10.1109/CVPR52729.2023.01843

work page doi:10.1109/cvpr52729.2023.01843 2023

[35] [35]

Gˆ 3-lq: Marrying hyperbolic alignment with explicit semantic-geometric modeling for 3d visual grounding,

Y . Wang, Y . Li, and S. Wang, “Gˆ 3-lq: Marrying hyperbolic alignment with explicit semantic-geometric modeling for 3d visual grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 917–13 926

work page 2024

[36] [36]

Multi-branch collaborative learning network for 3d visual grounding,

Z. Qian, Y . Ma, Z. Lin, J. Ji, X. Zheng, X. Sun, and R. Ji, “Multi-branch collaborative learning network for 3d visual grounding,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 381–398

work page 2024

[37] [37]

Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding,

Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2662–2671

work page 2023

[38] [38]

Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,

S. Linok, T. Zemskova, S. Ladanova, R. Titkov, D. Yudin, M. Monastyrny, and A. Valenkov, “Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,” 2025. [Online]. Available: https://arxiv.org/abs/2406.07113

work page arXiv 2025

[39] [39]

ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” 2017. [Online]. Available: https://arxiv.org/abs/1702.04405

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Ultralytics yolov8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

work page 2023

[41] [41]

Qwen3 Technical Report

A. Yang, A. Liet al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025