SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching
Pith reviewed 2026-05-22 08:43 UTC · model grok-4.3
The pith
Reformulating 3D visual grounding as structured graph matching on a scene graph built from 2D views enables competitive zero-shot localization from natural language using only RGB-D inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SceneGraphGrounder reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. A visual marker prompting strategy enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, a query graph is constructed and aligned with the scene graph under constraints that enforce multi-view consistency and interpretable reasoning.
What carries the argument
Constrained alignment between a language-derived query graph and a persistent 3D scene graph whose edges and nodes are populated by lifting VLM-inferred relations from multiple 2D RGB-D views.
If this is right
- The method reaches competitive accuracy among zero-shot approaches on the ScanRefer benchmark while using only RGB-D sensor data.
- The same pipeline supports direct deployment on a mobile robot and maintains spatial reasoning across long sequences of actions in physical space.
- Reasoning remains interpretable because every alignment step operates on explicit graph edges rather than implicit feature vectors.
Where Pith is reading between the lines
- Extending the graph with temporal edges could allow the same matching process to track objects across time without separate tracking modules.
- The explicit graph representation may make it easier to incorporate additional constraints such as physics or commonsense rules during alignment.
- Because the method separates scene construction from query matching, it could be combined with faster 3D reconstruction pipelines to reduce latency in real-time settings.
Load-bearing premise
Relationships detected by the vision-language model in separate 2D images can be lifted into one 3D scene graph that remains free of contradictions when the same objects are seen from different angles.
What would settle it
A direct test would be to check whether the constructed 3D scene graph assigns conflicting spatial or semantic relations to the same pair of objects when the input views are rotated or reordered; systematic conflicts would show the lifting step does not produce a reliable persistent representation.
Figures
read the original abstract
Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SceneGraphGrounder, a zero-shot 3D visual grounding framework that reformulates the task as constrained graph matching between a query graph and a reconstructed 3D scene graph. The core technical contribution is a visual marker prompting strategy that elicits object-object spatial and semantic relations from a VLM on 2D RGB-D views; these relations are lifted into a persistent 3D scene graph. Given a natural-language query, the method builds a corresponding query graph and performs alignment to localize the target object. Experiments are reported on the ScanRefer benchmark (claiming competitive results among zero-shot methods using only RGB-D input) together with real-robot deployment on a mobile platform for long-horizon tasks.
Significance. If the relation-lifting step proves reliable, the explicit scene-graph formulation could improve interpretability and multi-view consistency relative to direct VLM reasoning, especially for compositional queries. The real-world robot validation is a concrete strength that demonstrates practical utility beyond simulation benchmarks. The approach also supplies a clear, modular pipeline that could be extended or ablated in future work.
major comments (2)
- [Abstract] Abstract: the claim of 'competitive performance among zero-shot approaches' on ScanRefer is presented without any quantitative numbers, baseline comparisons, or error breakdown. Because the central empirical claim rests on this result, the absence of these data prevents verification of whether the graph-matching formulation actually delivers the advertised gains.
- [Method (scene-graph construction)] The lifting procedure (visual marker prompting followed by 2D-to-3D relation transfer) is described as producing a 'persistent 3D scene graph' without any stated mechanism for detecting or resolving cross-view inconsistencies or depth-verified conflicts. Because VLMs are known to generate view-dependent spatial hallucinations, this step is load-bearing for the multi-view consistency and real-robot robustness claims; its correctness must be demonstrated with explicit validation metrics or conflict-resolution logic.
minor comments (2)
- The manuscript states that code will be released upon acceptance; adding a footnote or repository link in the camera-ready version would improve reproducibility.
- [Figure 1] Ensure that any diagram of the overall pipeline explicitly annotates the lifting and consistency-enforcement stages so readers can trace how 2D inferences become 3D relations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'competitive performance among zero-shot approaches' on ScanRefer is presented without any quantitative numbers, baseline comparisons, or error breakdown. Because the central empirical claim rests on this result, the absence of these data prevents verification of whether the graph-matching formulation actually delivers the advertised gains.
Authors: We agree that the abstract would benefit from including specific quantitative results to support the performance claim. In the revised manuscript, we will add a concise statement with key metrics (such as the grounding accuracy on ScanRefer and direct comparison to other zero-shot RGB-D methods) along with a reference to the full baseline table and error analysis in the Experiments section. This change will make the central empirical contribution immediately verifiable from the abstract. revision: yes
-
Referee: [Method (scene-graph construction)] The lifting procedure (visual marker prompting followed by 2D-to-3D relation transfer) is described as producing a 'persistent 3D scene graph' without any stated mechanism for detecting or resolving cross-view inconsistencies or depth-verified conflicts. Because VLMs are known to generate view-dependent spatial hallucinations, this step is load-bearing for the multi-view consistency and real-robot robustness claims; its correctness must be demonstrated with explicit validation metrics or conflict-resolution logic.
Authors: We acknowledge that the current Method section does not explicitly describe mechanisms for handling cross-view inconsistencies in the relation-lifting process. To address this, we will expand the scene-graph construction subsection to include our conflict-resolution logic: depth-verified consistency checks across overlapping views combined with a simple voting scheme to filter view-dependent hallucinations. We will also add validation metrics (e.g., conflict resolution rate and consistency scores on held-out multi-view sequences) to empirically support the persistence and robustness claims. These additions will be placed in the revised manuscript. revision: yes
Circularity Check
Reformulation as graph matching with VLM prompting shows no circular reduction to inputs
full rationale
The paper reformulates 3D visual grounding as constrained alignment between a query graph and a 3D scene graph constructed by lifting VLM-inferred relations from 2D views via visual marker prompting. No equations, derivations, or fitted parameters are presented that reduce the claimed competitive zero-shot performance on ScanRefer or the robot deployment results to self-referential definitions or by-construction predictions. The central claims rest on the external capabilities of VLMs and standard graph matching, which are treated as independent inputs rather than outputs of the method itself. No self-citation chains or uniqueness theorems are invoked to force the framework's validity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VLM can infer accurate object-object relationships from 2D views that lift consistently to 3D
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph... visual marker prompting strategy that enables a VLM to infer object–object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent,
J. Yang, X. Chen, S. Qian, N. Madaan, M. Iyengar, D. F. Fouhey, and J. Chai, “Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent,” 2023. [Online]. Available: https://arxiv.org/abs/2309.12311
-
[2]
Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,
R. Xu, Z. Huang, T. Wang, Y . Chen, J. Pang, and D. Lin, “Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,” 2024. [Online]. Available: https://arxiv.org/abs/2410.13860
-
[3]
Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,
R. Li, S. Li, L. Kong, X. Yang, and J. Liang, “Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,” 2025. [Online]. Available: https://arxiv.org/abs/2412.04383
-
[4]
Visual programming for zero-shot open-vocabulary 3d visual grounding,
Z. Yuan, J. Ren, C.-M. Feng, H. Zhao, S. Cui, and Z. Li, “Visual programming for zero-shot open-vocabulary 3d visual grounding,”
-
[5]
Available: https://arxiv.org/abs/2311.15383
[Online]. Available: https://arxiv.org/abs/2311.15383
-
[6]
Lerf: Language embedded radiance fields,
J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 19 729–19 739
work page 2023
-
[7]
Openscene: 3d scene understanding with open vocabularies,
S. Peng, K. Genova, C. M. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “Openscene: 3d scene understanding with open vocabularies,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[8]
Task and motion planning in hierarchical 3d scene graphs,
A. Ray, C. Bradley, L. Carlone, and N. Roy, “Task and motion planning in hierarchical 3d scene graphs,” 2024. [Online]. Available: https://arxiv.org/abs/2403.08094
-
[9]
Spatial amr: Expanded spatial annotation in the context of a grounded minecraft cor- pus,
J. Bonn, M. Palmer, Z. Cai, and K. Wright-Bettner, “Spatial amr: Expanded spatial annotation in the context of a grounded minecraft cor- pus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4883–4892
work page 2020
-
[10]
R. Paul, J. Arkin, N. Roy, and T. M Howard, “Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,” 2016
work page 2016
-
[11]
M. Xu, M. Wu, Y . Zhao, J. C. L. Li, and W. Ou, “Llava- spacesgg: Visual instruct tuning for open-vocabulary scene graph generation with enhanced spatial relations,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06322
-
[12]
Visually-prompted language model for fine-grained scene graph generation in an open world,
Q. Yu, J. Li, Y . Wu, S. Tang, W. Ji, and Y . Zhuang, “Visually-prompted language model for fine-grained scene graph generation in an open world,” 2023. [Online]. Available: https://arxiv.org/abs/2303.13233
-
[13]
Z. Chen, J. Wu, Z. Lei, Z. Zhang, and C. Chen, “Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual- concept alignment and retention,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 108–124
work page 2024
-
[14]
Relation-aware hierarchical prompt for open-vocabulary scene graph generation,
T. Liu, R. Li, C. Wang, and X. He, “Relation-aware hierarchical prompt for open-vocabulary scene graph generation,” 2025. [Online]. Available: https://arxiv.org/abs/2412.19021
-
[15]
Scene graph generation with role-playing large language models,
G. Chen, J. Li, and W. Wang, “Scene graph generation with role-playing large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2410.15364
-
[16]
Learning Transferable Visual Models From Natural Language Supervision
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Learning 3d semantic scene graphs from 3d indoor reconstructions,
J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” 2020. [Online]. Available: https://arxiv.org/abs/2004.03967
-
[18]
Z. Wang, B. Cheng, L. Zhao, D. Xu, Y . Tang, and L. Sheng, “Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud,” 2023. [Online]. Available: https://arxiv.org/abs/2303.14408
-
[19]
Exploiting edge-oriented reasoning for 3d point-based scene graph analysis,
C. Zhang, J. Yu, Y . Song, and W. Cai, “Exploiting edge-oriented reasoning for 3d point-based scene graph analysis,” 2021. [Online]. Available: https://arxiv.org/abs/2103.05558
-
[20]
Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,
S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,” 2021. [Online]. Available: https://arxiv.org/abs/2103.14898
-
[21]
Incremental 3d semantic scene graph prediction from rgb sequences,
S.-C. Wu, K. Tateno, N. Navab, and F. Tombari, “Incremental 3d semantic scene graph prediction from rgb sequences,” 2023. [Online]. Available: https://arxiv.org/abs/2305.02743
-
[22]
Exploiting contextual objects and relations for 3d visual grounding,
L. Yang, Z. Zhang, Z. Qi, Y . Xu, W. Liu, Y . Shan, B. Li, W. Yang, P. Li, Y . Wanget al., “Exploiting contextual objects and relations for 3d visual grounding,”Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[23]
Knowledge-inspired 3d scene graph prediction in point cloud,
S. Zhang, s. li, A. Hao, and H. Qin, “Knowledge-inspired 3d scene graph prediction in point cloud,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 18 620– 18 632. [Online]. Available: https://proceedings.neurips.cc/paper files/ pap...
work page 2021
-
[24]
3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud,
M. Feng, H. Hou, L. Zhang, Z. Wu, Y . Guo, and A. Mian, “3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9182–9191
work page 2023
-
[25]
Scanrefer: 3d object localization in rgb-d scans using natural language,
D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” 2020. [Online]. Available: https://arxiv.org/abs/1912.08830
-
[26]
Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,
P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 422–440
work page 2020
-
[27]
3d-vista: Pre-trained transformer for 3d vision and text alignment,
Z. Ziyu, M. Xiaojian, C. Yixin, D. Zhidong, H. Siyuan, and L. Qing, “3d-vista: Pre-trained transformer for 3d vision and text alignment,” in ICCV, 2023
work page 2023
-
[28]
Z. Yuan, X. Yan, Y . Liao, R. Zhang, S. Wang, Z. Li, and S. Cui, “Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1791–1800
work page 2021
-
[29]
Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,
Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” 2023. [Online]. Available: https://arxiv.org/abs/2309.16650
-
[30]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026
work page 2023
-
[31]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,” 2023. [Online]. Available: https://arxiv.org/abs/2310.11441
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
3DVG-Transformer: Relation modeling for visual grounding on point clouds,
L. Zhao, D. Cai, L. Sheng, and D. Xu, “3DVG-Transformer: Relation modeling for visual grounding on point clouds,” inICCV, 2021, pp. 2928–2937
work page 2021
-
[33]
Bottom up top down detection transformers for language grounding in images and point clouds,
A. Jain, N. Gkanatsios, I. Mediratta, and K. Fragkiadaki, “Bottom up top down detection transformers for language grounding in images and point clouds,” 2022. [Online]. Available: https://arxiv.org/abs/2112.08879
-
[34]
In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Y . Wu, X. Cheng, R. Zhang, Z. Cheng, and J. Zhang, “Eda: Explicit text-decoupling and dense alignment for 3d visual grounding,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023, p. 19231–19242. [Online]. Available: http://dx.doi.org/10.1109/CVPR52729.2023.01843
-
[35]
Y . Wang, Y . Li, and S. Wang, “Gˆ 3-lq: Marrying hyperbolic alignment with explicit semantic-geometric modeling for 3d visual grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 917–13 926
work page 2024
-
[36]
Multi-branch collaborative learning network for 3d visual grounding,
Z. Qian, Y . Ma, Z. Lin, J. Ji, X. Zheng, X. Sun, and R. Ji, “Multi-branch collaborative learning network for 3d visual grounding,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 381–398
work page 2024
-
[37]
Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding,
Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2662–2671
work page 2023
-
[38]
Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,
S. Linok, T. Zemskova, S. Ladanova, R. Titkov, D. Yudin, M. Monastyrny, and A. Valenkov, “Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,” 2025. [Online]. Available: https://arxiv.org/abs/2406.07113
-
[39]
ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes
A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” 2017. [Online]. Available: https://arxiv.org/abs/1702.04405
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics
work page 2023
-
[41]
A. Yang, A. Liet al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.