arxiv: 2603.08096 · v3 · submitted 2026-03-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

Bryce Grant , Aryeh Rothenberg , Atri Banerjee , Peng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D localizationtext-guided segmentationpose-freefeed-forwardgeometry-aware attentioncross-view consistencyroboticsAR

0 comments

The pith

TrianguLang enables 3D object localization from a single text query without camera poses or optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TrianguLang, a feed-forward framework that localizes objects in 3D from natural language across multiple views. It uses predicted geometry to gate cross-view feature matches, removing semantically similar but geometrically wrong correspondences. This avoids the need for ground-truth poses or per-scene optimization that slows down prior methods. The result is state-of-the-art performance on benchmarks like ScanNet++ while running at interactive speeds. A single text query replaces the multiple clicks required by previous approaches.

Core claim

TrianguLang shows that a geometry-aware semantic attention module can enforce cross-view consistency using only the model's own predicted geometry, delivering accurate pose-free 3D localization and segmentation from text inputs.

What carries the argument

Geometry-Aware Semantic Attention (GASA), which gates cross-view correspondences with predicted geometry to suppress inconsistent matches.

If this is right

Reaches state-of-the-art accuracy in feed-forward text-guided 3D tasks on five benchmarks.
Operates at approximately 18 frames per second on 1008x1008 resolution images.
Requires no camera calibration or iterative optimization at inference time.
Simplifies interaction to one text query rather than O(N) manual annotations.
Supports practical use in robotics and AR applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique could be combined with improved geometry predictors to handle more challenging scenes.
Similar attention gating might apply to other multi-view problems like reconstruction or tracking.
Performance gains may diminish if geometry predictions are unreliable in certain environments.
The feed-forward nature opens possibilities for end-to-end training on larger datasets.

Load-bearing premise

Predicted geometry is accurate enough to suppress geometrically inconsistent semantic matches across views without ground-truth poses.

What would settle it

A dataset or scenario where the geometry predictor frequently errs would show if the method underperforms compared to pose-based alternatives.

Figures

Figures reproduced from arXiv: 2603.08096 by Aryeh Rothenberg, Atri Banerjee, Bryce Grant, Peng Wang.

**Figure 2.** Figure 2: Overview of the GASA decoder. 3.5 3D Localization Beyond 2D segmentation, TrianguLang directly predicts 3D object centroids via mask-weighted depth unprojection. For each view i: \mathbf {c}_i = \frac {\sum _{u,v} \hat {M}_i(u,v) \cdot \mathbf {P}_i(u,v)}{\sum _{u,v} \hat {M}_i(u,v) + \epsilon } \label {eq:centroid} (3) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Performance on uCO3D and ScanNet++ datasets. Left-to-right: RGB, depth map, ground truth, TrianguLang masks Protocol. Following MV-SAM [13], we evaluate on 100 frames per scene with 5 randomly sampled objects (excluding structural classes: wall, floor, ceiling). We report mIoU (mean Intersection-over-Union) and mAcc (mean per-class accuracy), averaged across 5 random seeds for object sampling. Baselines. W… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on LERF-OVS scenes using uniform clip thresholds. Row 1: “toaster” query: LERF and LangSplatV2 produce diffuse activations across the scene while TrianguLang tightly focuses its relevancy map on the target object. Row 3: “stripes” query: TrianguLang achieves precise localization despite not training on this dataset, and runs 3 orders of magnitude faster (∼58ms vs. 10 to 45 min). comp… view at source ↗

**Figure 5.** Figure 5: Spatial disambiguation on the NVOS T-Rex scene. Top: The query “dino” (97.6% IoU) segments the dominant triceratops skull in the scene. Bottom: The query “leftmost dino” (95.8% IoU) leverages spatial reasoning to disambiguate between the two skulls, correctly selecting only the left specimen. The depth map (second column) provides the geometric context that enables this: TrianguLang computes 3D centroids f… view at source ↗

**Figure 6.** Figure 6: Segmentation results on the SPIn-NeRF room scene (90.7% mean IoU). Each row shows a different viewpoint: RGB input, DA3 depth estimate, ground truth mask, predicted mask, and overlay. When queried for “table,” TrianguLang produces a clean segmentation of the table surface without including the conference equipment (microphones) present in the ground truth annotation, demonstrating learned semantic boundar… view at source ↗

**Figure 7.** Figure 7: TSDF mesh reconstructions from TrianguLang segmentations on ScanNet++ scenes. Left: “sofa chair,” Center: “coffee table,” Right: “TV.” Each mesh is extracted by fusing masked metric depth maps across 8 views using TSDF integration. The clean geometry demonstrates that TrianguLang produces view-consistent segmentations suitable for 3D reconstruction [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

read the original abstract

Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically-plausible but geometrically-inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from $O(N)$ clicks to a single text query. The model processes each frame at 1008x1008 resolution in $\sim$57ms ($\sim$18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications. Code and checkpoints are available at https://cwru-aism.github.io/triangulang/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrianguLang adds a geometry-gated attention step for pose-free text localization and reports fast SOTA numbers, but the gating's value rests on unverified self-predicted geometry.

read the letter

TrianguLang introduces Geometry-Aware Semantic Attention to filter cross-view matches in a feed-forward text-guided 3D localization setup. It avoids camera poses and per-scene optimization by predicting geometry on the fly and using that to suppress semantically plausible but geometrically inconsistent correspondences. The model runs at roughly 18 FPS on 1008x1008 inputs and claims state-of-the-art results on five benchmarks including ScanNet++ and uCO3D, with code released.

Referee Report

3 major / 2 minor

Summary. The paper introduces TrianguLang, a feed-forward neural framework for text-guided 3D object and part localization that operates without camera poses or per-scene optimization at inference. It proposes Geometry-Aware Semantic Attention (GASA) to gate cross-view feature correspondences using the model's own predicted geometry, thereby suppressing semantically plausible but geometrically inconsistent matches. The method is evaluated on five benchmarks (including ScanNet++ and uCO3D), reporting state-of-the-art feed-forward performance while achieving ~18 FPS inference at 1008x1008 resolution; code and checkpoints are released.

Significance. If the central GASA mechanism proves reliable, the work would meaningfully advance practical text-driven 3D localization for robotics and AR by eliminating the need for poses or iterative optimization, reducing user input to a single text query. The release of code and checkpoints strengthens reproducibility.

major comments (3)

[Method] Method section (GASA description): The gating mechanism assumes the model's predicted geometry is sufficiently accurate to reliably reject inconsistent cross-view matches, yet no quantitative geometry error metrics (e.g., depth or pose prediction accuracy on the benchmarks) or ablation removing the geometry gate are provided; this leaves the load-bearing claim unverified.
[Experiments] Experiments and results: SOTA numbers are reported on five benchmarks, but the absence of ablations isolating the contribution of GASA (versus standard attention or independent per-view processing) and of error analysis on geometry predictions makes it impossible to confirm that the geometry-aware gating improves rather than harms correspondence quality.
[Abstract] Abstract and §4: The performance claims rest on supervised training with external benchmarks, but no analysis shows that the self-supervised geometry predictions generalize to the point where gating is robust without ground-truth poses, which is the key differentiator from prior pose-free methods.

minor comments (2)

[Figures] Figure captions and notation: The description of GASA could clarify the exact form of the geometry prediction head and how the gating threshold is set or learned.
[Related Work] Related work: A brief comparison table with recent feed-forward localization methods would help situate the contribution more clearly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing clarifications on the existing manuscript content and committing to targeted additions that will strengthen the presentation of the GASA mechanism and its empirical validation.

read point-by-point responses

Referee: [Method] Method section (GASA description): The gating mechanism assumes the model's predicted geometry is sufficiently accurate to reliably reject inconsistent cross-view matches, yet no quantitative geometry error metrics (e.g., depth or pose prediction accuracy on the benchmarks) or ablation removing the geometry gate are provided; this leaves the load-bearing claim unverified.

Authors: We agree that explicit quantitative validation of the predicted geometry would make the load-bearing role of GASA clearer. Although the manuscript emphasizes end-to-end localization metrics, we will add depth prediction error statistics (e.g., absolute relative error on ScanNet++) and an ablation that disables the geometry gate while keeping all other components fixed. These additions will appear in a new subsection of §3 and an expanded Table 2. revision: yes
Referee: [Experiments] Experiments and results: SOTA numbers are reported on five benchmarks, but the absence of ablations isolating the contribution of GASA (versus standard attention or independent per-view processing) and of error analysis on geometry predictions makes it impossible to confirm that the geometry-aware gating improves rather than harms correspondence quality.

Authors: We acknowledge the value of isolating GASA's contribution. In the revision we will insert a dedicated ablation study that compares (i) the full TrianguLang model, (ii) a variant using standard cross-view attention without geometry gating, and (iii) independent per-view processing. We will also report geometry prediction errors alongside the localization metrics on all five benchmarks to demonstrate that the gate improves rather than degrades correspondence quality. revision: yes
Referee: [Abstract] Abstract and §4: The performance claims rest on supervised training with external benchmarks, but no analysis shows that the self-supervised geometry predictions generalize to the point where gating is robust without ground-truth poses, which is the key differentiator from prior pose-free methods.

Authors: The geometry branch is trained with a self-supervised multi-view consistency loss and receives no ground-truth poses or depth at inference; the strong feed-forward results across diverse benchmarks already serve as evidence of generalization. Nevertheless, we will add a short analysis subsection in §4 that quantifies how well the predicted geometry supports gating (e.g., fraction of suppressed matches that would have been incorrect) and includes qualitative visualizations of gated versus ungated correspondences on held-out scenes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; GASA uses learned geometry predictions from supervised training on external benchmarks without definitional reduction or fitted-input renaming.

full rationale

The paper trains a feed-forward model end-to-end on standard external benchmarks (ScanNet++, uCO3D) using supervised losses. GASA gates correspondences with the model's own predicted geometry, but this is an architectural choice whose quality is measured against held-out ground truth rather than being true by construction. No equation equates a claimed performance metric to a parameter fitted from the same quantity, and no self-citation chain supplies the central uniqueness or ansatz. The derivation therefore remains self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning training assumptions plus the domain assumption that predicted geometry suffices for gating; no new physical entities are introduced.

free parameters (1)

neural network weights
Large number of parameters fitted during training on the cited benchmarks.

axioms (1)

domain assumption Predicted geometry from the network is sufficiently accurate to suppress geometrically inconsistent matches
Invoked as the core operation of GASA without ground-truth poses.

pith-pipeline@v0.9.0 · 5501 in / 1219 out tokens · 69311 ms · 2026-05-15T15:08:49.402786+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GASA augments self-attention ... +β·ϕ(∥P_Q - P_K∥²) ... distance kernel ϕ ... 2-layer MLP initialized to approximate -log(1+d)
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

8 views per scene ... 8-tick period never mentioned

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 8 internal anchors

[1]

Cai, Z., Liu, S., Wang, G., Ge, Z., Zhang, X., Huang, D.: Align-detr: Enhancing end-to-end object detection with aligned loss (2024),https://arxiv.org/abs/ 2304.07527

work page arXiv 2024
[2]

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Cen, J., Fang, J., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment any 3d gaussians (2025),https://arxiv.org/abs/2312.00860

work page arXiv 2025
[4]

Cen, J., Fang, J., Zhou, Z., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment anything in 3d with radiance fields (2024),https://arxiv.org/abs/2304.12308

work page arXiv 2024
[5]

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Driess, D., Florence, P., Sadigh, D., Guibas, L., Xia, F.: Spatialvlm: Endowing vision-language models with spatial reasoning capabilities (2024),https://arxiv.org/abs/2401.12168

work page arXiv 2024
[6]

Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3d object localization in rgb-d scans using natural language (2020),https://arxiv.org/abs/1912.08830

work page arXiv 2020
[7]

Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision language models (2024), https://arxiv.org/abs/2406.01584

work page arXiv 2024
[8]

Choi,S.,Song,H.,Kim,J.,Kim,T.,Do,H.:Click-gaussian:Interactivesegmentation to any 3d gaussians (2024),https://arxiv.org/abs/2407.11793

work page arXiv 2024
[9]

Ding, S., Qian, R., Dong, X., Zhang, P., Zang, Y., Cao, Y., Gao, Y., Lin, D., Wang, J.: Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree (2024),https://arxiv.org/abs/2410.16268

work page arXiv 2024
[10]

He, S., Jie, G., Wang, C., Zhou, Y., Hu, S., Li, G., Ding, H.: Refersplat: Referring segmentation in 3d gaussian splatting (2025),https://arxiv.org/abs/2508.08252

work page arXiv 2025
[11]

Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., Gan, C.: 3d-llm: Injecting the 3d world into large language models (2023),https://arxiv.org/abs/2307. 12981

work page 2023
[12]

Huang, J., Yong, S., Ma, X., Linghu, X., Li, P., Wang, Y., Li, Q., Zhu, S.C., Jia, B., Huang, S.: An embodied generalist agent in 3d world (2024), https: //arxiv.org/abs/2311.12871

work page arXiv 2024
[13]

Jeong, Y., Sun, C., Wang, Y.C.F., Cho, M., Choe, J.: Mv-sam: Multi-view prompt- able segmentation using pointmap guidance (2026),https://arxiv.org/abs/2601. 17866

work page 2026
[14]

Kang, R., Chen, H., Gkioxari, G., Perona, P.: Linear mechanisms for spatiotemporal reasoning in vision language models (2026),https://arxiv.org/abs/2601.12626

work page arXiv 2026
[15]

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Bulò, S.R., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Mapanything: Universal feed-forward metric 3d reconstruction (2026),https://arxiv.org/abs/2509.13414

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering (2023),https://arxiv.org/abs/2308.04079

work page arXiv 2023
[17]

Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields (2023),https://arxiv.org/abs/2303.09553

work page arXiv 2023
[18]

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything (2023),https://arxiv.org/abs/2304.02643

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Kuang, L., Velikova, Y., Saleh, M., Zaech, J.N., Paudel, D.P., Busam, B.: Concept- pose: Training-free zero-shot object pose estimation using concept vectors (2025), https://arxiv.org/abs/2512.09056

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Lee, T., Wen, B., Kang, M., Kang, G., Kweon, I.S., Yoon, K.J.: Any6d: Model-free 6d pose estimation of novel objects (2025),https://arxiv.org/abs/2503.18673

work page arXiv 2025
[21]

org/abs/2506.09565

Li, Q., Sun, J., An, L., Su, Z., Zhang, H., Liu, Y.: Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields (2025),https://arxiv. org/abs/2506.09565

work page arXiv 2025
[22]

Li, W., Zhao, Y., Qin, M., Liu, Y., Cai, Y., Gan, C., Pfister, H.: Langsplatv2: High-dimensional 3d language gaussian splatting with 450+ fps (2025),https: //arxiv.org/abs/2507.07136

work page arXiv 2025
[23]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views (2025),https://arxiv. org/abs/2511.10647

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., Li, X., Sun, X., Ashok, R., Mukherjee, A., Kang, H., Kong, X., Hua, G., Zhang, T., Benes, B., Bera, A.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision (2023),https://arxiv.org/abs/2312.16256

work page arXiv 2023
[25]

Liu, X., Tayal, P., Wang, J., Zarzar, J., Monnier, T., Tertikas, K., Duan, J., Toisoul, A., Zhang, J.Y., Neverova, N., Vedaldi, A., Shapovalov, R., Novotny, D.: Uncommon objects in 3d (2025),https://arxiv.org/abs/2501.07574

work page arXiv 2025
[26]

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis (2020), https://arxiv.org/abs/2003.08934

work page arXiv 2020
[27]

Mirzaei, A., Aumentado-Armstrong, T., Derpanis, K.G., Kelly, J., Brubaker, M.A., Gilitschenski, I., Levinshtein, A.: Spin-nerf: Multiview segmentation and perceptual inpainting with neural radiance fields (2023),https://arxiv.org/abs/2211.12254

work page arXiv 2023
[28]

Miyato, T., Jaeger, B., Welling, M., Geiger, A.: Gta: A geometry-aware attention mechanism for multi-view transformers (2024), https://arxiv.org/abs/2310. 10375

work page 2024
[29]

Nguyen, P.D.A., Ngo, T.D., Kalogerakis, E., Gan, C., Tran, A., Pham, C., Nguyen, K.: Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance (2024),https://arxiv.org/abs/2312.10671

work page arXiv 2024
[30]

Peng, S., Genova, K., Jiang, C.M., Tagliasacchi, A., Pollefeys, M., Funkhouser, T.: Openscene: 3d scene understanding with open vocabularies (2023),https: //arxiv.org/abs/2211.15654

work page arXiv 2023
[31]

Peng, Y., Wang, C., Wang, X., Lu, Y., Wang, J., Fu, Y.: Gags: Granularity-aware feature distillation for language gaussian splatting (2024)

work page 2024
[32]

Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting (2024),https://arxiv.org/abs/2312.16084

work page arXiv 2024
[33]

In: Proceedings of the 32nd ACM International Conference on Multimedia (2024)

Qu, Y., Dai, S., Li, X., Lin, J., Cao, L., Zhang, S., Ji, R.: Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. In: Proceedings of the 32nd ACM International Conference on Multimedia (2024)

work page 2024
[34]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024),https://arxiv.org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Ren, Z., Agarwala, A., Russell, B., Schwing, A.G., Wang, O.: Neural volumetric object selection (2022),https://arxiv.org/abs/2205.14929

work page arXiv 2022
[37]

Segre, L., Hirschorn, O., Avidan, S.: Multi-view foundation models (2025),https: //arxiv.org/abs/2512.15708

work page arXiv 2025
[38]

Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding (2023),https://arxiv.org/abs/2311.18482

work page arXiv 2023
[39]

org/abs/2306.13631

Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: Open-vocabulary 3d instance segmentation (2023),https://arxiv. org/abs/2306.13631

work page arXiv 2023
[40]

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer (2025), https://arxiv.org/abs/2503. 11651

work page 2025
[41]

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy (2024),https://arxiv.org/abs/2312.14132

work page arXiv 2024
[42]

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.: π3: Permutation-equivariant visual geometry learning (2025), https://arxiv.org/abs/2507.13347

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects (2024),https://arxiv.org/abs/2312. 08344

work page 2024
[44]

Yang, J., Chen, X., Qian, S., Madaan, N., Iyengar, M., Fouhey, D.F., Chai, J.: Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent (2023),https://arxiv.org/abs/2309.12311

work page arXiv 2023
[45]

Ye, M., Danelljan, M., Yu, F., Ke, L.: Gaussian grouping: Segment and edit anything in 3d scenes (2024),https://arxiv.org/abs/2312.00732

work page arXiv 2024
[46]

Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes (2023),https://arxiv.org/abs/2308.11417

work page arXiv 2023
[47]

Ying, H., Yin, Y., Zhang, J., Wang, F., Yu, T., Huang, R., Fang, L.: Omniseg3d: Omniversal 3d segmentation via hierarchical contrastive learning (2023),https: //arxiv.org/abs/2311.11666

work page arXiv 2023
[48]

Yu, J., Hari, K., Srinivas, K., El-Refai, K., Rashid, A., Kim, C.M., Kerr, J., Cheng, R., Irshad, M.Z., Balakrishna, A., Kollar, T., Goldberg, K.: Language-embedded gaussian splats (legs): Incrementally building room-scale representations with a mobile robot (2024),https://arxiv.org/abs/2409.18108

work page arXiv 2024
[49]

Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding (2024),https: //arxiv.org/abs/2401.01970 A Appendix A.1 GASA Decoder Layer Each of the 6 GASA decoder layers processes queries through four sequential operations:

work page arXiv 2024
[50]

Self-Attention: 3 learnable queries attend to each other via standard multi- head attention (8 heads, dimension 32 per head)

work page
[51]

Text Cross-Attention: Queries attend to text embeddings from SAM3’s text encoder, applied ateverylayer following SAM3’s design

work page
[52]

GASA Cross-Attention: Queries attend to encoder memory with geometric bias (Eq. 2)

work page
[53]

nearest chair

Feed-Forward Network: Two-layer MLP with GELU activation (256→ 2048→256) A.2 Distance Kernel The geometric kernelϕin GASA is a small MLP: ϕ(d) =w ⊤ 2 ·ReLU(w 1d+b 1) +b 2 (5) withw 1, w2 ∈R 32. The input distanced is in meters (from DA3 metric depth), providing a natural scale for indoor scenes. We initializeb2 = −1to encourage suppression of distant matc...

work page 2048
[54]

Compute spatial context:For each visible instance of the target class, compute the 2D centroid(cx, cy)(normalized to[0 , 1]) and depth d at the centroid from the DA3 depth map

work page
[55]

A qualifier is valid only if it istrue: e.g., “nearest” requires dtarget ≤min(d others) +ϵ

Determine true qualifiers:Compare the target instance against all same- class instances. A qualifier is valid only if it istrue: e.g., “nearest” requires dtarget ≤min(d others) +ϵ

work page
[56]

chair”→ “nearest chair

Augment with probabilityp:With probability p = 0.3, randomly select one valid qualifier and prepend it to the text prompt (e.g., “chair”→ “nearest chair”). This ensures the model only trains on correct spatial associations, preventing confusion from false labels. Multi-instance filtering.When spatial_multi_instance_only is enabled, aug- mentation is skipp...

work page arXiv 2048