pith. machine review for the scientific record. sign in

arxiv: 2603.08096 · v3 · submitted 2026-03-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D localizationtext-guided segmentationpose-freefeed-forwardgeometry-aware attentioncross-view consistencyroboticsAR
0
0 comments X

The pith

TrianguLang enables 3D object localization from a single text query without camera poses or optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TrianguLang, a feed-forward framework that localizes objects in 3D from natural language across multiple views. It uses predicted geometry to gate cross-view feature matches, removing semantically similar but geometrically wrong correspondences. This avoids the need for ground-truth poses or per-scene optimization that slows down prior methods. The result is state-of-the-art performance on benchmarks like ScanNet++ while running at interactive speeds. A single text query replaces the multiple clicks required by previous approaches.

Core claim

TrianguLang shows that a geometry-aware semantic attention module can enforce cross-view consistency using only the model's own predicted geometry, delivering accurate pose-free 3D localization and segmentation from text inputs.

What carries the argument

Geometry-Aware Semantic Attention (GASA), which gates cross-view correspondences with predicted geometry to suppress inconsistent matches.

If this is right

  • Reaches state-of-the-art accuracy in feed-forward text-guided 3D tasks on five benchmarks.
  • Operates at approximately 18 frames per second on 1008x1008 resolution images.
  • Requires no camera calibration or iterative optimization at inference time.
  • Simplifies interaction to one text query rather than O(N) manual annotations.
  • Supports practical use in robotics and AR applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique could be combined with improved geometry predictors to handle more challenging scenes.
  • Similar attention gating might apply to other multi-view problems like reconstruction or tracking.
  • Performance gains may diminish if geometry predictions are unreliable in certain environments.
  • The feed-forward nature opens possibilities for end-to-end training on larger datasets.

Load-bearing premise

Predicted geometry is accurate enough to suppress geometrically inconsistent semantic matches across views without ground-truth poses.

What would settle it

A dataset or scenario where the geometry predictor frequently errs would show if the method underperforms compared to pose-based alternatives.

Figures

Figures reproduced from arXiv: 2603.08096 by Aryeh Rothenberg, Atri Banerjee, Bryce Grant, Peng Wang.

Figure 1
Figure 1. Figure 1: illustrates the TrianguLang architecture [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the GASA decoder. 3.5 3D Localization Beyond 2D segmentation, TrianguLang directly predicts 3D object centroids via mask-weighted depth unprojection. For each view i: \mathbf {c}_i = \frac {\sum _{u,v} \hat {M}_i(u,v) \cdot \mathbf {P}_i(u,v)}{\sum _{u,v} \hat {M}_i(u,v) + \epsilon } \label {eq:centroid} (3) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance on uCO3D and ScanNet++ datasets. Left-to-right: RGB, depth map, ground truth, TrianguLang masks Protocol. Following MV-SAM [13], we evaluate on 100 frames per scene with 5 randomly sampled objects (excluding structural classes: wall, floor, ceiling). We report mIoU (mean Intersection-over-Union) and mAcc (mean per-class accuracy), averaged across 5 random seeds for object sampling. Baselines. W… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on LERF-OVS scenes using uniform clip thresholds. Row 1: “toaster” query: LERF and LangSplatV2 produce diffuse activations across the scene while TrianguLang tightly focuses its relevancy map on the target object. Row 3: “stripes” query: TrianguLang achieves precise localization despite not training on this dataset, and runs 3 orders of magnitude faster (∼58ms vs. 10 to 45 min). comp… view at source ↗
Figure 5
Figure 5. Figure 5: Spatial disambiguation on the NVOS T-Rex scene. Top: The query “dino” (97.6% IoU) segments the dominant triceratops skull in the scene. Bottom: The query “leftmost dino” (95.8% IoU) leverages spatial reasoning to disambiguate between the two skulls, correctly selecting only the left specimen. The depth map (second column) provides the geometric context that enables this: TrianguLang computes 3D centroids f… view at source ↗
Figure 6
Figure 6. Figure 6: Segmentation results on the SPIn-NeRF room scene (90.7% mean IoU). Each row shows a different viewpoint: RGB input, DA3 depth estimate, ground truth mask, predicted mask, and overlay. When queried for “table,” TrianguLang produces a clean segmentation of the table surface without including the conference equipment (mi￾crophones) present in the ground truth annotation, demonstrating learned semantic boundar… view at source ↗
Figure 7
Figure 7. Figure 7: TSDF mesh reconstructions from TrianguLang segmentations on ScanNet++ scenes. Left: “sofa chair,” Center: “coffee table,” Right: “TV.” Each mesh is extracted by fusing masked metric depth maps across 8 views using TSDF integration. The clean geometry demonstrates that TrianguLang produces view-consistent segmentations suitable for 3D reconstruction [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
read the original abstract

Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically-plausible but geometrically-inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from $O(N)$ clicks to a single text query. The model processes each frame at 1008x1008 resolution in $\sim$57ms ($\sim$18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications. Code and checkpoints are available at https://cwru-aism.github.io/triangulang/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TrianguLang, a feed-forward neural framework for text-guided 3D object and part localization that operates without camera poses or per-scene optimization at inference. It proposes Geometry-Aware Semantic Attention (GASA) to gate cross-view feature correspondences using the model's own predicted geometry, thereby suppressing semantically plausible but geometrically inconsistent matches. The method is evaluated on five benchmarks (including ScanNet++ and uCO3D), reporting state-of-the-art feed-forward performance while achieving ~18 FPS inference at 1008x1008 resolution; code and checkpoints are released.

Significance. If the central GASA mechanism proves reliable, the work would meaningfully advance practical text-driven 3D localization for robotics and AR by eliminating the need for poses or iterative optimization, reducing user input to a single text query. The release of code and checkpoints strengthens reproducibility.

major comments (3)
  1. [Method] Method section (GASA description): The gating mechanism assumes the model's predicted geometry is sufficiently accurate to reliably reject inconsistent cross-view matches, yet no quantitative geometry error metrics (e.g., depth or pose prediction accuracy on the benchmarks) or ablation removing the geometry gate are provided; this leaves the load-bearing claim unverified.
  2. [Experiments] Experiments and results: SOTA numbers are reported on five benchmarks, but the absence of ablations isolating the contribution of GASA (versus standard attention or independent per-view processing) and of error analysis on geometry predictions makes it impossible to confirm that the geometry-aware gating improves rather than harms correspondence quality.
  3. [Abstract] Abstract and §4: The performance claims rest on supervised training with external benchmarks, but no analysis shows that the self-supervised geometry predictions generalize to the point where gating is robust without ground-truth poses, which is the key differentiator from prior pose-free methods.
minor comments (2)
  1. [Figures] Figure captions and notation: The description of GASA could clarify the exact form of the geometry prediction head and how the gating threshold is set or learned.
  2. [Related Work] Related work: A brief comparison table with recent feed-forward localization methods would help situate the contribution more clearly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing clarifications on the existing manuscript content and committing to targeted additions that will strengthen the presentation of the GASA mechanism and its empirical validation.

read point-by-point responses
  1. Referee: [Method] Method section (GASA description): The gating mechanism assumes the model's predicted geometry is sufficiently accurate to reliably reject inconsistent cross-view matches, yet no quantitative geometry error metrics (e.g., depth or pose prediction accuracy on the benchmarks) or ablation removing the geometry gate are provided; this leaves the load-bearing claim unverified.

    Authors: We agree that explicit quantitative validation of the predicted geometry would make the load-bearing role of GASA clearer. Although the manuscript emphasizes end-to-end localization metrics, we will add depth prediction error statistics (e.g., absolute relative error on ScanNet++) and an ablation that disables the geometry gate while keeping all other components fixed. These additions will appear in a new subsection of §3 and an expanded Table 2. revision: yes

  2. Referee: [Experiments] Experiments and results: SOTA numbers are reported on five benchmarks, but the absence of ablations isolating the contribution of GASA (versus standard attention or independent per-view processing) and of error analysis on geometry predictions makes it impossible to confirm that the geometry-aware gating improves rather than harms correspondence quality.

    Authors: We acknowledge the value of isolating GASA's contribution. In the revision we will insert a dedicated ablation study that compares (i) the full TrianguLang model, (ii) a variant using standard cross-view attention without geometry gating, and (iii) independent per-view processing. We will also report geometry prediction errors alongside the localization metrics on all five benchmarks to demonstrate that the gate improves rather than degrades correspondence quality. revision: yes

  3. Referee: [Abstract] Abstract and §4: The performance claims rest on supervised training with external benchmarks, but no analysis shows that the self-supervised geometry predictions generalize to the point where gating is robust without ground-truth poses, which is the key differentiator from prior pose-free methods.

    Authors: The geometry branch is trained with a self-supervised multi-view consistency loss and receives no ground-truth poses or depth at inference; the strong feed-forward results across diverse benchmarks already serve as evidence of generalization. Nevertheless, we will add a short analysis subsection in §4 that quantifies how well the predicted geometry supports gating (e.g., fraction of suppressed matches that would have been incorrect) and includes qualitative visualizations of gated versus ungated correspondences on held-out scenes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; GASA uses learned geometry predictions from supervised training on external benchmarks without definitional reduction or fitted-input renaming.

full rationale

The paper trains a feed-forward model end-to-end on standard external benchmarks (ScanNet++, uCO3D) using supervised losses. GASA gates correspondences with the model's own predicted geometry, but this is an architectural choice whose quality is measured against held-out ground truth rather than being true by construction. No equation equates a claimed performance metric to a parameter fitted from the same quantity, and no self-citation chain supplies the central uniqueness or ansatz. The derivation therefore remains self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning training assumptions plus the domain assumption that predicted geometry suffices for gating; no new physical entities are introduced.

free parameters (1)
  • neural network weights
    Large number of parameters fitted during training on the cited benchmarks.
axioms (1)
  • domain assumption Predicted geometry from the network is sufficiently accurate to suppress geometrically inconsistent matches
    Invoked as the core operation of GASA without ground-truth poses.

pith-pipeline@v0.9.0 · 5501 in / 1219 out tokens · 69311 ms · 2026-05-15T15:08:49.402786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 8 internal anchors

  1. [1]

    Cai, Z., Liu, S., Wang, G., Ge, Z., Zhang, X., Huang, D.: Align-detr: Enhancing end-to-end object detection with aligned loss (2024),https://arxiv.org/abs/ 2304.07527

  2. [2]

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

  3. [3]

    Cen, J., Fang, J., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment any 3d gaussians (2025),https://arxiv.org/abs/2312.00860

  4. [4]

    Cen, J., Fang, J., Zhou, Z., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment anything in 3d with radiance fields (2024),https://arxiv.org/abs/2304.12308

  5. [5]

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Driess, D., Florence, P., Sadigh, D., Guibas, L., Xia, F.: Spatialvlm: Endowing vision-language models with spatial reasoning capabilities (2024),https://arxiv.org/abs/2401.12168

  6. [6]

    Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3d object localization in rgb-d scans using natural language (2020),https://arxiv.org/abs/1912.08830

  7. [7]

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision language models (2024), https://arxiv.org/abs/2406.01584

  8. [8]

    Choi,S.,Song,H.,Kim,J.,Kim,T.,Do,H.:Click-gaussian:Interactivesegmentation to any 3d gaussians (2024),https://arxiv.org/abs/2407.11793

  9. [9]

    Ding, S., Qian, R., Dong, X., Zhang, P., Zang, Y., Cao, Y., Gao, Y., Lin, D., Wang, J.: Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree (2024),https://arxiv.org/abs/2410.16268

  10. [10]

    He, S., Jie, G., Wang, C., Zhou, Y., Hu, S., Li, G., Ding, H.: Refersplat: Referring segmentation in 3d gaussian splatting (2025),https://arxiv.org/abs/2508.08252

  11. [11]

    Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., Gan, C.: 3d-llm: Injecting the 3d world into large language models (2023),https://arxiv.org/abs/2307. 12981

  12. [12]

    Huang, J., Yong, S., Ma, X., Linghu, X., Li, P., Wang, Y., Li, Q., Zhu, S.C., Jia, B., Huang, S.: An embodied generalist agent in 3d world (2024), https: //arxiv.org/abs/2311.12871

  13. [13]

    Jeong, Y., Sun, C., Wang, Y.C.F., Cho, M., Choe, J.: Mv-sam: Multi-view prompt- able segmentation using pointmap guidance (2026),https://arxiv.org/abs/2601. 17866

  14. [14]

    Kang, R., Chen, H., Gkioxari, G., Perona, P.: Linear mechanisms for spatiotemporal reasoning in vision language models (2026),https://arxiv.org/abs/2601.12626

  15. [15]

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Bulò, S.R., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Mapanything: Universal feed-forward metric 3d reconstruction (2026),https://arxiv.org/abs/2509.13414

  16. [16]

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering (2023),https://arxiv.org/abs/2308.04079

  17. [17]

    Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields (2023),https://arxiv.org/abs/2303.09553

  18. [18]

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything (2023),https://arxiv.org/abs/2304.02643

  19. [19]

    Kuang, L., Velikova, Y., Saleh, M., Zaech, J.N., Paudel, D.P., Busam, B.: Concept- pose: Training-free zero-shot object pose estimation using concept vectors (2025), https://arxiv.org/abs/2512.09056

  20. [20]

    Lee, T., Wen, B., Kang, M., Kang, G., Kweon, I.S., Yoon, K.J.: Any6d: Model-free 6d pose estimation of novel objects (2025),https://arxiv.org/abs/2503.18673

  21. [21]

    org/abs/2506.09565

    Li, Q., Sun, J., An, L., Su, Z., Zhang, H., Liu, Y.: Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields (2025),https://arxiv. org/abs/2506.09565

  22. [22]

    Li, W., Zhao, Y., Qin, M., Liu, Y., Cai, Y., Gan, C., Pfister, H.: Langsplatv2: High-dimensional 3d language gaussian splatting with 450+ fps (2025),https: //arxiv.org/abs/2507.07136

  23. [23]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views (2025),https://arxiv. org/abs/2511.10647

  24. [24]

    Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., Li, X., Sun, X., Ashok, R., Mukherjee, A., Kang, H., Kong, X., Hua, G., Zhang, T., Benes, B., Bera, A.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision (2023),https://arxiv.org/abs/2312.16256

  25. [25]

    Liu, X., Tayal, P., Wang, J., Zarzar, J., Monnier, T., Tertikas, K., Duan, J., Toisoul, A., Zhang, J.Y., Neverova, N., Vedaldi, A., Shapovalov, R., Novotny, D.: Uncommon objects in 3d (2025),https://arxiv.org/abs/2501.07574

  26. [26]

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis (2020), https://arxiv.org/abs/2003.08934

  27. [27]

    Mirzaei, A., Aumentado-Armstrong, T., Derpanis, K.G., Kelly, J., Brubaker, M.A., Gilitschenski, I., Levinshtein, A.: Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields (2023),https://arxiv.org/abs/2211.12254

  28. [28]

    Miyato, T., Jaeger, B., Welling, M., Geiger, A.: Gta: A geometry-aware attention mechanism for multi-view transformers (2024), https://arxiv.org/abs/2310. 10375

  29. [29]

    Nguyen, P.D.A., Ngo, T.D., Kalogerakis, E., Gan, C., Tran, A., Pham, C., Nguyen, K.: Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance (2024),https://arxiv.org/abs/2312.10671

  30. [30]

    Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: Openscene: 3d scene understanding with open vocabularies (2023),https: //arxiv.org/abs/2211.15654

  31. [31]

    Peng, Y., Wang, C., Wang, X., Lu, Y., Wang, J., Fu, Y.: Gags: Granularity-aware feature distillation for language gaussian splatting (2024)

  32. [32]

    Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting (2024),https://arxiv.org/abs/2312.16084

  33. [33]

    In: Proceedings of the 32nd ACM International Conference on Multimedia (2024)

    Qu, Y., Dai, S., Li, X., Lin, J., Cao, L., Zhang, S., Ji, R.: Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. In: Proceedings of the 32nd ACM International Conference on Multimedia (2024)

  34. [34]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.00020

  35. [35]

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024),https://arxiv.org/abs/2408.00714

  36. [36]

    Ren, Z., Agarwala, A., Russell, B., Schwing, A.G., Wang, O.: Neural volumetric object selection (2022),https://arxiv.org/abs/2205.14929

  37. [37]

    Segre, L., Hirschorn, O., Avidan, S.: Multi-view foundation models (2025),https: //arxiv.org/abs/2512.15708

  38. [38]

    Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding (2023),https://arxiv.org/abs/2311.18482

  39. [39]

    org/abs/2306.13631

    Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: Open-vocabulary 3d instance segmentation (2023),https://arxiv. org/abs/2306.13631

  40. [40]

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer (2025), https://arxiv.org/abs/2503. 11651

  41. [41]

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy (2024),https://arxiv.org/abs/2312.14132

  42. [42]

    Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.: π3: Permutation-equivariant visual geometry learning (2025), https://arxiv.org/abs/2507.13347

  43. [43]

    Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects (2024),https://arxiv.org/abs/2312. 08344

  44. [44]

    Yang, J., Chen, X., Qian, S., Madaan, N., Iyengar, M., Fouhey, D.F., Chai, J.: Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent (2023),https://arxiv.org/abs/2309.12311

  45. [45]

    Ye, M., Danelljan, M., Yu, F., Ke, L.: Gaussian grouping: Segment and edit anything in 3d scenes (2024),https://arxiv.org/abs/2312.00732

  46. [46]

    Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes (2023),https://arxiv.org/abs/2308.11417

  47. [47]

    Ying, H., Yin, Y., Zhang, J., Wang, F., Yu, T., Huang, R., Fang, L.: Omniseg3d: Omniversal 3d segmentation via hierarchical contrastive learning (2023),https: //arxiv.org/abs/2311.11666

  48. [48]

    Yu, J., Hari, K., Srinivas, K., El-Refai, K., Rashid, A., Kim, C.M., Kerr, J., Cheng, R., Irshad, M.Z., Balakrishna, A., Kollar, T., Goldberg, K.: Language-embedded gaussian splats (legs): Incrementally building room-scale representations with a mobile robot (2024),https://arxiv.org/abs/2409.18108

  49. [49]

    Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding (2024),https: //arxiv.org/abs/2401.01970 A Appendix A.1 GASA Decoder Layer Each of the 6 GASA decoder layers processes queries through four sequential operations:

  50. [50]

    Self-Attention: 3 learnable queries attend to each other via standard multi- head attention (8 heads, dimension 32 per head)

  51. [51]

    Text Cross-Attention: Queries attend to text embeddings from SAM3’s text encoder, applied ateverylayer following SAM3’s design

  52. [52]

    GASA Cross-Attention: Queries attend to encoder memory with geometric bias (Eq. 2)

  53. [53]

    nearest chair

    Feed-Forward Network: Two-layer MLP with GELU activation (256→ 2048→256) A.2 Distance Kernel The geometric kernelϕin GASA is a small MLP: ϕ(d) =w ⊤ 2 ·ReLU(w 1d+b 1) +b 2 (5) withw 1, w2 ∈R 32. The input distanced is in meters (from DA3 metric depth), providing a natural scale for indoor scenes. We initializeb2 = −1to encourage suppression of distant matc...

  54. [54]

    Compute spatial context:For each visible instance of the target class, compute the 2D centroid(cx, cy)(normalized to[0 , 1]) and depth d at the centroid from the DA3 depth map

  55. [55]

    A qualifier is valid only if it istrue: e.g., “nearest” requires dtarget ≤min(d others) +ϵ

    Determine true qualifiers:Compare the target instance against all same- class instances. A qualifier is valid only if it istrue: e.g., “nearest” requires dtarget ≤min(d others) +ϵ

  56. [56]

    chair”→ “nearest chair

    Augment with probabilityp:With probability p = 0.3, randomly select one valid qualifier and prepend it to the text prompt (e.g., “chair”→ “nearest chair”). This ensures the model only trains on correct spatial associations, preventing confusion from false labels. Multi-instance filtering.When spatial_multi_instance_only is enabled, aug- mentation is skipp...