pith. sign in

arxiv: 2605.21788 · v1 · pith:EI4WX4QYnew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

Pith reviewed 2026-05-22 08:43 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords zero-shot 3D visual groundingscene graph matchingvision-language modelsRGB-D inputsrobot deploymentspatial reasoningScanRefer benchmark
0
0 comments X

The pith

Reformulating 3D visual grounding as structured graph matching on a scene graph built from 2D views enables competitive zero-shot localization from natural language using only RGB-D inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework called SceneGraphGrounder that turns the problem of finding objects in 3D space from free-form language descriptions into a graph alignment task. It first uses a vision-language model prompted with visual markers to extract relationships between objects across multiple 2D RGB-D views, then lifts those relations into a single consistent 3D scene graph that stores both spatial positions and semantic connections. For any given language query, the method builds a corresponding query graph and finds the best constrained match inside the scene graph. A sympathetic reader would care because this yields interpretable, multi-view consistent results without requiring training on grounding-specific datasets, and the approach is shown to transfer from benchmarks to physical robot operation in extended real environments.

Core claim

SceneGraphGrounder reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. A visual marker prompting strategy enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, a query graph is constructed and aligned with the scene graph under constraints that enforce multi-view consistency and interpretable reasoning.

What carries the argument

Constrained alignment between a language-derived query graph and a persistent 3D scene graph whose edges and nodes are populated by lifting VLM-inferred relations from multiple 2D RGB-D views.

If this is right

  • The method reaches competitive accuracy among zero-shot approaches on the ScanRefer benchmark while using only RGB-D sensor data.
  • The same pipeline supports direct deployment on a mobile robot and maintains spatial reasoning across long sequences of actions in physical space.
  • Reasoning remains interpretable because every alignment step operates on explicit graph edges rather than implicit feature vectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the graph with temporal edges could allow the same matching process to track objects across time without separate tracking modules.
  • The explicit graph representation may make it easier to incorporate additional constraints such as physics or commonsense rules during alignment.
  • Because the method separates scene construction from query matching, it could be combined with faster 3D reconstruction pipelines to reduce latency in real-time settings.

Load-bearing premise

Relationships detected by the vision-language model in separate 2D images can be lifted into one 3D scene graph that remains free of contradictions when the same objects are seen from different angles.

What would settle it

A direct test would be to check whether the constructed 3D scene graph assigns conflicting spatial or semantic relations to the same pair of objects when the input views are rotated or reordered; systematic conflicts would show the lifting step does not produce a reliable persistent representation.

Figures

Figures reproduced from arXiv: 2605.21788 by Brendan Crowe, Christoffer Heckman, Doncey Albin, Xuefei Sun, Xujia Zhang.

Figure 1
Figure 1. Figure 1: System Overview. Our framework takes RGB-D images, sensor odometry, and a user query as input. The graph is lifted into 3D via visual prompting and association with the reconstructed point cloud, forming a global 3D scene graph. Object grounding is achieved via graph matching with the parsed query. improve generalization to unseen objects and predicates by leveraging large VLMs and language priors. Within … view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative results. Left: ScanRefer. Right: real-world robot experiments. Rendered images highlight ground truth objects ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Robot experiment setup. Left: bird’s-eye view of trajectories taken by our quadrupedal robotics platform. Right: the platform and onboard sensing suite used in our experiments. VI. RESULTS A. Main Results on ScanRefer Table I presents the performance of our method on the ScanRefer validation set under both unique and multiple object splits. We compare against a wide range of baselines, including fully supe… view at source ↗
Figure 4
Figure 4. Figure 4: Performance analysis on ScanRefer. backbone models, our framework consistently improves per￾formance, showing that stronger multimodal reasoning benefits 3D grounding under limited geometric cues. 1) Unique vs. Multiple Splits: A consistent trend across all methods is the significant performance gap between the unique and multiple splits. As shown in Table I, performance on unique scenes is substantially h… view at source ↗
Figure 6
Figure 6. Figure 6: Real-robot Experiment: (a) relationship-type distribution, and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Statistical analysis of the ScanRefer test split: (a) VLM usage [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SceneGraphGrounder, a zero-shot 3D visual grounding framework that reformulates the task as constrained graph matching between a query graph and a reconstructed 3D scene graph. The core technical contribution is a visual marker prompting strategy that elicits object-object spatial and semantic relations from a VLM on 2D RGB-D views; these relations are lifted into a persistent 3D scene graph. Given a natural-language query, the method builds a corresponding query graph and performs alignment to localize the target object. Experiments are reported on the ScanRefer benchmark (claiming competitive results among zero-shot methods using only RGB-D input) together with real-robot deployment on a mobile platform for long-horizon tasks.

Significance. If the relation-lifting step proves reliable, the explicit scene-graph formulation could improve interpretability and multi-view consistency relative to direct VLM reasoning, especially for compositional queries. The real-world robot validation is a concrete strength that demonstrates practical utility beyond simulation benchmarks. The approach also supplies a clear, modular pipeline that could be extended or ablated in future work.

major comments (2)
  1. [Abstract] Abstract: the claim of 'competitive performance among zero-shot approaches' on ScanRefer is presented without any quantitative numbers, baseline comparisons, or error breakdown. Because the central empirical claim rests on this result, the absence of these data prevents verification of whether the graph-matching formulation actually delivers the advertised gains.
  2. [Method (scene-graph construction)] The lifting procedure (visual marker prompting followed by 2D-to-3D relation transfer) is described as producing a 'persistent 3D scene graph' without any stated mechanism for detecting or resolving cross-view inconsistencies or depth-verified conflicts. Because VLMs are known to generate view-dependent spatial hallucinations, this step is load-bearing for the multi-view consistency and real-robot robustness claims; its correctness must be demonstrated with explicit validation metrics or conflict-resolution logic.
minor comments (2)
  1. The manuscript states that code will be released upon acceptance; adding a footnote or repository link in the camera-ready version would improve reproducibility.
  2. [Figure 1] Ensure that any diagram of the overall pipeline explicitly annotates the lifting and consistency-enforcement stages so readers can trace how 2D inferences become 3D relations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'competitive performance among zero-shot approaches' on ScanRefer is presented without any quantitative numbers, baseline comparisons, or error breakdown. Because the central empirical claim rests on this result, the absence of these data prevents verification of whether the graph-matching formulation actually delivers the advertised gains.

    Authors: We agree that the abstract would benefit from including specific quantitative results to support the performance claim. In the revised manuscript, we will add a concise statement with key metrics (such as the grounding accuracy on ScanRefer and direct comparison to other zero-shot RGB-D methods) along with a reference to the full baseline table and error analysis in the Experiments section. This change will make the central empirical contribution immediately verifiable from the abstract. revision: yes

  2. Referee: [Method (scene-graph construction)] The lifting procedure (visual marker prompting followed by 2D-to-3D relation transfer) is described as producing a 'persistent 3D scene graph' without any stated mechanism for detecting or resolving cross-view inconsistencies or depth-verified conflicts. Because VLMs are known to generate view-dependent spatial hallucinations, this step is load-bearing for the multi-view consistency and real-robot robustness claims; its correctness must be demonstrated with explicit validation metrics or conflict-resolution logic.

    Authors: We acknowledge that the current Method section does not explicitly describe mechanisms for handling cross-view inconsistencies in the relation-lifting process. To address this, we will expand the scene-graph construction subsection to include our conflict-resolution logic: depth-verified consistency checks across overlapping views combined with a simple voting scheme to filter view-dependent hallucinations. We will also add validation metrics (e.g., conflict resolution rate and consistency scores on held-out multi-view sequences) to empirically support the persistence and robustness claims. These additions will be placed in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Reformulation as graph matching with VLM prompting shows no circular reduction to inputs

full rationale

The paper reformulates 3D visual grounding as constrained alignment between a query graph and a 3D scene graph constructed by lifting VLM-inferred relations from 2D views via visual marker prompting. No equations, derivations, or fitted parameters are presented that reduce the claimed competitive zero-shot performance on ScanRefer or the robot deployment results to self-referential definitions or by-construction predictions. The central claims rest on the external capabilities of VLMs and standard graph matching, which are treated as independent inputs rather than outputs of the method itself. No self-citation chains or uniqueness theorems are invoked to force the framework's validity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified ability of current VLMs to produce accurate, liftable object relations from 2D views and on the assumption that graph matching will remain robust in real-world long-horizon settings; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption VLM can infer accurate object-object relationships from 2D views that lift consistently to 3D
    Invoked in the description of the visual marker prompting strategy and subsequent lifting step.

pith-pipeline@v0.9.0 · 5735 in / 1357 out tokens · 39668 ms · 2026-05-22T08:43:59.574142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

  1. [1]

    Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent,

    J. Yang, X. Chen, S. Qian, N. Madaan, M. Iyengar, D. F. Fouhey, and J. Chai, “Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent,” 2023. [Online]. Available: https://arxiv.org/abs/2309.12311

  2. [2]

    Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,

    R. Xu, Z. Huang, T. Wang, Y . Chen, J. Pang, and D. Lin, “Vlm-grounder: A vlm agent for zero-shot 3d visual grounding,” 2024. [Online]. Available: https://arxiv.org/abs/2410.13860

  3. [3]

    Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,

    R. Li, S. Li, L. Kong, X. Yang, and J. Liang, “Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding,” 2025. [Online]. Available: https://arxiv.org/abs/2412.04383

  4. [4]

    Visual programming for zero-shot open-vocabulary 3d visual grounding,

    Z. Yuan, J. Ren, C.-M. Feng, H. Zhao, S. Cui, and Z. Li, “Visual programming for zero-shot open-vocabulary 3d visual grounding,”

  5. [5]

    Available: https://arxiv.org/abs/2311.15383

    [Online]. Available: https://arxiv.org/abs/2311.15383

  6. [6]

    Lerf: Language embedded radiance fields,

    J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 19 729–19 739

  7. [7]

    Openscene: 3d scene understanding with open vocabularies,

    S. Peng, K. Genova, C. M. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “Openscene: 3d scene understanding with open vocabularies,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  8. [8]

    Task and motion planning in hierarchical 3d scene graphs,

    A. Ray, C. Bradley, L. Carlone, and N. Roy, “Task and motion planning in hierarchical 3d scene graphs,” 2024. [Online]. Available: https://arxiv.org/abs/2403.08094

  9. [9]

    Spatial amr: Expanded spatial annotation in the context of a grounded minecraft cor- pus,

    J. Bonn, M. Palmer, Z. Cai, and K. Wright-Bettner, “Spatial amr: Expanded spatial annotation in the context of a grounded minecraft cor- pus,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 4883–4892

  10. [10]

    Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,

    R. Paul, J. Arkin, N. Roy, and T. M Howard, “Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,” 2016

  11. [11]

    Llava- spacesgg: Visual instruct tuning for open-vocabulary scene graph generation with enhanced spatial relations,

    M. Xu, M. Wu, Y . Zhao, J. C. L. Li, and W. Ou, “Llava- spacesgg: Visual instruct tuning for open-vocabulary scene graph generation with enhanced spatial relations,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06322

  12. [12]

    Visually-prompted language model for fine-grained scene graph generation in an open world,

    Q. Yu, J. Li, Y . Wu, S. Tang, W. Ji, and Y . Zhuang, “Visually-prompted language model for fine-grained scene graph generation in an open world,” 2023. [Online]. Available: https://arxiv.org/abs/2303.13233

  13. [13]

    Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual- concept alignment and retention,

    Z. Chen, J. Wu, Z. Lei, Z. Zhang, and C. Chen, “Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual- concept alignment and retention,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 108–124

  14. [14]

    Relation-aware hierarchical prompt for open-vocabulary scene graph generation,

    T. Liu, R. Li, C. Wang, and X. He, “Relation-aware hierarchical prompt for open-vocabulary scene graph generation,” 2025. [Online]. Available: https://arxiv.org/abs/2412.19021

  15. [15]

    Scene graph generation with role-playing large language models,

    G. Chen, J. Li, and W. Wang, “Scene graph generation with role-playing large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2410.15364

  16. [16]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

  17. [17]

    Learning 3d semantic scene graphs from 3d indoor reconstructions,

    J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” 2020. [Online]. Available: https://arxiv.org/abs/2004.03967

  18. [18]

    Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud,

    Z. Wang, B. Cheng, L. Zhao, D. Xu, Y . Tang, and L. Sheng, “Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud,” 2023. [Online]. Available: https://arxiv.org/abs/2303.14408

  19. [19]

    Exploiting edge-oriented reasoning for 3d point-based scene graph analysis,

    C. Zhang, J. Yu, Y . Song, and W. Cai, “Exploiting edge-oriented reasoning for 3d point-based scene graph analysis,” 2021. [Online]. Available: https://arxiv.org/abs/2103.05558

  20. [20]

    Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,

    S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,” 2021. [Online]. Available: https://arxiv.org/abs/2103.14898

  21. [21]

    Incremental 3d semantic scene graph prediction from rgb sequences,

    S.-C. Wu, K. Tateno, N. Navab, and F. Tombari, “Incremental 3d semantic scene graph prediction from rgb sequences,” 2023. [Online]. Available: https://arxiv.org/abs/2305.02743

  22. [22]

    Exploiting contextual objects and relations for 3d visual grounding,

    L. Yang, Z. Zhang, Z. Qi, Y . Xu, W. Liu, Y . Shan, B. Li, W. Yang, P. Li, Y . Wanget al., “Exploiting contextual objects and relations for 3d visual grounding,”Advances in Neural Information Processing Systems, vol. 36, 2024

  23. [23]

    Knowledge-inspired 3d scene graph prediction in point cloud,

    S. Zhang, s. li, A. Hao, and H. Qin, “Knowledge-inspired 3d scene graph prediction in point cloud,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 18 620– 18 632. [Online]. Available: https://proceedings.neurips.cc/paper files/ pap...

  24. [24]

    3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud,

    M. Feng, H. Hou, L. Zhang, Z. Wu, Y . Guo, and A. Mian, “3d spatial multimodal knowledge accumulation for scene graph prediction in point cloud,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9182–9191

  25. [25]

    Scanrefer: 3d object localization in rgb-d scans using natural language,

    D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” 2020. [Online]. Available: https://arxiv.org/abs/1912.08830

  26. [26]

    Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,

    P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 422–440

  27. [27]

    3d-vista: Pre-trained transformer for 3d vision and text alignment,

    Z. Ziyu, M. Xiaojian, C. Yixin, D. Zhidong, H. Siyuan, and L. Qing, “3d-vista: Pre-trained transformer for 3d vision and text alignment,” in ICCV, 2023

  28. [28]

    Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,

    Z. Yuan, X. Yan, Y . Liao, R. Zhang, S. Wang, Z. Li, and S. Cui, “Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1791–1800

  29. [29]

    Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,

    Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” 2023. [Online]. Available: https://arxiv.org/abs/2309.16650

  30. [30]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  31. [31]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,” 2023. [Online]. Available: https://arxiv.org/abs/2310.11441

  32. [32]

    3DVG-Transformer: Relation modeling for visual grounding on point clouds,

    L. Zhao, D. Cai, L. Sheng, and D. Xu, “3DVG-Transformer: Relation modeling for visual grounding on point clouds,” inICCV, 2021, pp. 2928–2937

  33. [33]

    Bottom up top down detection transformers for language grounding in images and point clouds,

    A. Jain, N. Gkanatsios, I. Mediratta, and K. Fragkiadaki, “Bottom up top down detection transformers for language grounding in images and point clouds,” 2022. [Online]. Available: https://arxiv.org/abs/2112.08879

  34. [34]

    In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Y . Wu, X. Cheng, R. Zhang, Z. Cheng, and J. Zhang, “Eda: Explicit text-decoupling and dense alignment for 3d visual grounding,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023, p. 19231–19242. [Online]. Available: http://dx.doi.org/10.1109/CVPR52729.2023.01843

  35. [35]

    Gˆ 3-lq: Marrying hyperbolic alignment with explicit semantic-geometric modeling for 3d visual grounding,

    Y . Wang, Y . Li, and S. Wang, “Gˆ 3-lq: Marrying hyperbolic alignment with explicit semantic-geometric modeling for 3d visual grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 917–13 926

  36. [36]

    Multi-branch collaborative learning network for 3d visual grounding,

    Z. Qian, Y . Ma, Z. Lin, J. Ji, X. Zheng, X. Sun, and R. Ji, “Multi-branch collaborative learning network for 3d visual grounding,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 381–398

  37. [37]

    Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding,

    Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2662–2671

  38. [38]

    Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,

    S. Linok, T. Zemskova, S. Ladanova, R. Titkov, D. Yudin, M. Monastyrny, and A. Valenkov, “Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,” 2025. [Online]. Available: https://arxiv.org/abs/2406.07113

  39. [39]

    ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” 2017. [Online]. Available: https://arxiv.org/abs/1702.04405

  40. [40]

    Ultralytics yolov8,

    G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

  41. [41]

    Qwen3 Technical Report

    A. Yang, A. Liet al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388