pith. sign in

arxiv: 2606.31533 · v1 · pith:KHXS3UIZnew · submitted 2026-06-30 · 💻 cs.CV

MV-GEL: Language-Driven Multi-View Geometric Entity Localization on Meshes

Pith reviewed 2026-07-01 05:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-view geometric localizationlanguage-driven segmentation3D mesh entity groundingvision language modelsviewpoint selectionCAD model processingreferring segmentation
0
0 comments X

The pith

Selecting language-guided viewpoints allows accurate localization of edges and faces on 3D meshes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework for grounding natural language descriptions to specific geometric entities on 3D polygon meshes. It establishes that choosing viewpoints where the target entity is most visible and least occluded is essential for reliable performance with vision-language models. The method ranks candidate views using a dedicated module, applies segmentation to the selected images, and projects the results back to the mesh using ray casting. This addresses the viewpoint sensitivity that makes direct 3D localization from language difficult in CAD and robotics applications.

Core claim

MV-GEL shows that reliable localization of fine-grained geometric entities such as faces, edges, and solids on meshes from language queries requires a prompt-conditioned ranking of viewpoints to maximize interpretability, followed by VLM-based mask prediction and geometry-aware lifting to the mesh.

What carries the argument

GELviews, the prompt-conditioned ranking module that prioritizes viewpoints based on language prompted observability of geometric CAD entities.

If this is right

  • Face-level IoU improves by up to 1.7 times over vanilla baselines.
  • Edge-level F1 scores increase by over 4.5 times, especially for thin structures.
  • The framework outperforms both CLIP-based and random view sampling methods.
  • It operates on any 3D mesh without requiring CAD-specific information.
  • Predicted masks from selected views are lifted accurately via ray casting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating this view selection with robotic planning could enable language-based manipulation of specific object parts.
  • Applying similar ranking to time-varying scenes might extend the approach to video or animation data.
  • The reliance on mesh geometry suggests potential adaptation to other 3D formats like voxels if projection methods are adjusted.
  • The performance gains on view-sensitive structures indicate broader applicability in scientific visualization tasks.

Load-bearing premise

Reliable CAD entity localization depends on selecting views that make the queried entity maximally interpretable.

What would settle it

Running the localization task on the evaluation meshes using only randomly selected views and finding no significant drop in IoU or F1 scores compared to the ranked views would falsify the importance of the view selection step.

Figures

Figures reproduced from arXiv: 2606.31533 by Kartik Bali, Roland Aydin.

Figure 1
Figure 1. Figure 1: We can tie our multi-view strategy to a more general digestibility paradigm, where certain representations of data are more interpretable to a language model than others parts. In robotics, grasp planning depends on functional regions such as han￾dles; in CAD and manufacturing, simulation and editing require selecting exact geometric entities including fillets, chamfers, and load-bearing surfaces. These ta… view at source ↗
Figure 2
Figure 2. Figure 2: GELviews: Our View selection framework takes the natural language prompt containing the geometric and view-point descriptions and returns semantically optimal views. In this example, the views with planar face at the bottom with cut-outs get higher ranks than the top views. Language Query Construction For each sampled entity’s maximal view, we construct a natural language query that uniquely identifies its… view at source ↗
Figure 3
Figure 3. Figure 3: MV-GEL Pipeline: In our framework, we use GELviews to compute the top-k views most relevant to localize a feature mentioned in the prompt. These views are then given sequentially to LISA-CAD for semantic segmentation to obtain view binary masks. Finally these binary masks are projected to the final mesh to obtain mesh features (edges/faces). Pairwise Ranking Objective GELviews is trained to prioritize view… view at source ↗
Figure 4
Figure 4. Figure 4: Examples of language-driven face and edge localization. The green mask denotes entity segmentation on the meshes via our MV-GEL framework. conducted on 4 NVIDIA H100 GPUs, updating only the LoRA adapter layers and the mask decoder while keeping the remaining backbone parameters frozen. We use a learning rate of 1 × 10−4 for all our trainable modules. Evaluation is performed on a held-out test set of 1535 q… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of view quantity on recall performance for Faces (a) and Edges (b) using CAD adapted LISA. weight definition. The use of area and length weighting ensures that evaluation reflects true geometric coverage and structural correctness [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Inference efficiency and qualitative localization results. Left: [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MV-GEL localization pipeline. (Left) Default mesh view, (Center) GELviews selected view (top-1) segmented by LISA-CAD, (Right) MV-GEL localized entity pre￾dictions on the mesh. Ultimately, our results demonstrate that language-driven 3D localization is fundamentally a geometry-aware view prioritization problem, not merely an image-text alignment task. By bridging multimodal reasoning with object un￾derstan… view at source ↗
read the original abstract

Identifying and grounding precise geometric entities, such as edges, planar regions, and curved surfaces within 3D objects, is foundational to computer-aided design (CAD), robotic manipulation, and scientific simulation. Although modern Vision Language Models (VLMs) have advanced referring segmentation (RIS) in the image domain, extending such language-driven localization to structured 3D geometry is substantially harder. The 3D object appearance is highly sensitive to viewpoints; a single perspective may render a target entity clearly observable, while another may suffer from severe occlusion or foreshortening. In this work, we attempt to solve these challenges with MV-GEL (Multi-View Geometric Entity Localization), a framework for localizing fine-grained geometric entities on polygon meshes from natural language queries. Our key insight is that reliable CAD entity (i.e., faces, edges or solids) localization depends on selecting views that make the queried entity maximally interpretable. We introduce GELviews, a prompt-conditioned ranking module that prioritizes viewpoints based on language prompted observability of geometric CAD entities. Selected views are processed by a VLM-based reasoning segmentation backbone, and predicted masks are lifted to the corresponding meshes via geometry-aware ray casting. Our framework is completely CAD agnostic and relies only on 3D meshes. Experiments show up to a 1.7X improvement in face-level IoU and over 4.5X gains in edge-level F1 compared to vanilla baselines, substantially outperforming CLIP-based and random view sampling, particularly for thin and view-sensitive structures.The dataset, code and trained checkpoints are available at https://github.com/kbali1297/MV-GEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MV-GEL, a framework for language-driven localization of geometric entities (faces, edges, solids) on 3D polygon meshes. The key component is GELviews, a prompt-conditioned module that ranks viewpoints for maximal interpretability of the queried entity. Selected views are segmented using a VLM-based reasoning segmentation model, and the 2D masks are lifted to the mesh using geometry-aware ray casting. The method is claimed to be mesh-only and CAD-agnostic. Experiments report up to 1.7× face-level IoU and 4.5× edge-level F1 improvements over baselines including random and CLIP-based view sampling.

Significance. If the empirical gains hold under rigorous evaluation, the work could provide a practical advance in referring segmentation for structured 3D geometry, with potential applications in CAD, robotics, and simulation. The release of dataset, code, and checkpoints is a positive factor for reproducibility.

major comments (2)
  1. [Abstract] Abstract: The abstract states quantitative gains (up to 1.7X face-level IoU and over 4.5X edge-level F1) but supplies no experimental protocol, dataset details, baseline definitions, error bars, or statistical tests. This prevents evaluation of the central claim that GELviews outperforms random and CLIP-based sampling, particularly for thin structures.
  2. [§3] §3 (Method): The description of GELviews as a 'prompt-conditioned ranking module' that prioritizes viewpoints based on language-prompted observability lacks sufficient detail on its architecture, input features, training procedure, or scoring function. Without this, it is unclear how the module differs from standard VLM prompting or whether it introduces hidden parameters that contradict the 'CAD agnostic' claim.
minor comments (2)
  1. [Abstract] Abstract: The acronym 'GELviews' is introduced without an immediate parenthetical expansion or definition, reducing readability on first encounter.
  2. [Abstract] The paper states that 'the dataset, code and trained checkpoints are available' but provides no link or citation in the abstract; ensure the GitHub URL appears in the main text and is correctly formatted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states quantitative gains (up to 1.7X face-level IoU and over 4.5X edge-level F1) but supplies no experimental protocol, dataset details, baseline definitions, error bars, or statistical tests. This prevents evaluation of the central claim that GELviews outperforms random and CLIP-based sampling, particularly for thin structures.

    Authors: The abstract is written to be concise and highlight key results within typical length constraints. Full experimental details—including the mesh dataset derived from CAD models, baseline definitions (random sampling and CLIP-based view selection), evaluation protocol with face IoU and edge F1, error bars from repeated runs, and statistical comparisons—are provided in Section 4 and the supplementary material. We will revise the abstract to include a short clause referencing the evaluation metrics and dataset to improve standalone readability. revision: yes

  2. Referee: [§3] §3 (Method): The description of GELviews as a 'prompt-conditioned ranking module' that prioritizes viewpoints based on language-prompted observability lacks sufficient detail on its architecture, input features, training procedure, or scoring function. Without this, it is unclear how the module differs from standard VLM prompting or whether it introduces hidden parameters that contradict the 'CAD agnostic' claim.

    Authors: GELviews operates as a zero-shot ranking procedure that feeds rendered views and the language prompt into an off-the-shelf VLM to obtain observability scores; no additional architecture, trainable parameters, or fine-tuning is introduced. The scoring function directly uses the VLM's output logits or reasoning trace for entity visibility, preserving the CAD-agnostic property since only mesh rendering is required. We will expand the description in §3 with explicit pseudocode for the ranking step and a statement confirming the absence of hidden parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces MV-GEL as a framework combining a prompt-conditioned GELviews ranking module, VLM-based segmentation, and geometry-aware ray casting for lifting masks to meshes. No mathematical derivations, equations, or first-principles predictions are claimed that reduce to self-definition, fitted parameters renamed as outputs, or load-bearing self-citations. The approach relies on external VLM capabilities and standard mesh operations, with performance gains demonstrated empirically against independent baselines (random and CLIP sampling). The method is described as mesh-only and CAD-agnostic, making the contribution self-contained without any step that is equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract-only review prevents exhaustive enumeration; the listed items are the minimal assumptions required by the described pipeline.

axioms (2)
  • domain assumption Vision-language models can perform referring image segmentation on rendered views of 3D meshes
    The framework uses a VLM-based reasoning segmentation backbone as a core component.
  • domain assumption Geometry-aware ray casting can accurately transfer 2D segmentation masks onto 3D mesh surfaces
    The lifting step relies on this geometric operation.
invented entities (1)
  • GELviews no independent evidence
    purpose: Prompt-conditioned ranking module that prioritizes viewpoints based on language-prompted observability of geometric entities
    New module introduced to select informative views; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5828 in / 1429 out tokens · 27902 ms · 2026-07-01T05:54:34.648785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Abdelreheem, A., Skorokhodov, I., Ovsjanikov, M., Wonka, P.: Satr: Zero-shot se- mantic segmentation of 3d shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15166–15179 (2023)

  2. [2]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  3. [3]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Chen, Q., Wu, C., Ji, J., Ma, Y., Yang, D., Sun, X.: Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2132–2140 (2025)

  4. [4]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Chen, X., Li, Q., Wang, T., Xue, T., Pang, J.: Gennbv: Generalizable next-best- view policy for active 3d reconstruction. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 16436–16445 (2024)

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  6. [6]

    In: Proceedings

    Connolly, C.: The determination of next best views. In: Proceedings. 1985 IEEE international conference on robotics and automation. vol. 2, pp. 432–435. IEEE (1985)

  7. [7]

    Advances in neural information processing systems36, 49250–49267 (2023)

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

  8. [8]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query gen- eration for referring segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 16321–16330 (2021)

  9. [9]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Garosi, M., Tedoldi, R., Boscaini, D., Mancini, M., Sebe, N., Poiesi, F.: 3d part segmentation via geometric aggregation of 2d visual features. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 3257–3267. IEEE (2025)

  10. [10]

    In: International conference on machine learning

    Goyal, A., Law, H., Liu, B., Newell, A., Deng, J.: Revisiting point cloud shape classification with a simple and effective baseline. In: International conference on machine learning. pp. 3809–3820. PMLR (2021)

  11. [11]

    arXiv preprint arXiv:2309.00615 (2023)

    Guo, Z., Zhang, R., Zhu, X., Tang, Y., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., et al.: Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)

  12. [12]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Guo, Z., Tang, Y., Zhang, R., Wang, D., Wang, Z., Zhao, B., Li, X.: Viewrefer: Grasp the multi-view knowledge for 3d visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15372–15383 (2023)

  13. [13]

    arXiv preprint arXiv:2601.02457 (2026)

    Hadgi, S., Gong, B., Sundararaman, R., Pierson, E., Li, L., Wonka, P., Ovsjanikov, M.: Patchalign3d: Local feature alignment for dense 3d shape understanding. arXiv preprint arXiv:2601.02457 (2026)

  14. [14]

    arXiv preprint arXiv:2508.08252 (2025)

    He, S., Jie, G., Wang, C., Zhou, Y., Hu, S., Li, G., Ding, H.: Refersplat: Referring segmentation in 3d gaussian splatting. arXiv preprint arXiv:2508.08252 (2025)

  15. [15]

    Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Varma Keetha, Ayush Kumar Tewari, Joshua B

    Jatavallabhula, K.M., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Maalouf, A., Li, S., Iyer, G., Saryazdi, S., Keetha, N., et al.: Conceptfusion: Open-set mul- timodal 3d mapping. arXiv preprint arXiv:2302.07241 (2023)

  16. [16]

    arXiv preprint arXiv:2203.13944 (2022) MV-GEL: Multi-View Geometric Entity Localization on Meshes 17

    Jayaraman, P.K., Lambourne, J.G., Desai, N., Willis, K.D., Sanghi, A., Morris, N.J.: Solidgen: An autoregressive model for direct b-rep synthesis. arXiv preprint arXiv:2203.13944 (2022) MV-GEL: Multi-View Geometric Entity Localization on Meshes 17

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Jayaraman, P.K., Sanghi, A., Lambourne, J.G., Willis, K.D., Davies, T., Shayani, H., Morris, N.: Uv-net: Learning from boundary representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11703–11712 (2021)

  18. [18]

    In: International conference on machine learning

    Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)

  19. [19]

    ACM Transactions on Graphics (TOG)40(6), 1–18 (2021)

    Jones, B., Hildreth, D., Chen, D., Baran, I., Kim, V.G., Schulz, A.: Automate: A dataset and learning approach for automatic mating of cad assemblies. ACM Transactions on Graphics (TOG)40(6), 1–18 (2021)

  20. [20]

    In: European Conference on Computer Vision

    Kareem, A., Lahoud, J., Cholakkal, H.: Paris3d: Reasoning-based 3d part seg- mentation using large multimodal model. In: European Conference on Computer Vision. pp. 466–482. Springer (2024)

  21. [21]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language em- bedded radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 19729–19739 (2023)

  22. [22]

    Ad- vances in Neural Information Processing Systems37, 7552–7579 (2024)

    Khan, M.S., Sinha, S., Sheikh, T.U., Stricker, D., Ali, S.A., Afzal, M.Z.: Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts. Ad- vances in Neural Information Processing Systems37, 7552–7579 (2024)

  23. [23]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

  24. [24]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Koch, S., Matveev, A., Jiang, Z., Williams, F., Artemov, A., Burnaev, E., Alexa, M., Zorin, D., Panozzo, D.: Abc: A big cad model dataset for geometric deep learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9601–9611 (2019)

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning seg- mentation via large language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9579–9589 (2024)

  26. [26]

    Lambourne, J.G., Willis, K.D., Jayaraman, P.K., Sanghi, A., Meltzer, P., Shayani, H.:Brepnet:Atopologicalmessagepassingsystemforsolidmodels.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12773–12782 (2021)

  27. [27]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Li, J., Ma, W., Li, X., Lou, Y., Zhou, G., Zhou, X.: Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 18563–18573 (2025)

  28. [28]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  29. [29]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10965–10975 (2022)

  30. [30]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, C., Ding, H., Jiang, X.: Gres: Generalized referring expression segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23592–23601 (2023)

  31. [31]

    Advances in neural information processing systems36, 34892–34916 (2023) 18 K

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 18 K. Bali et al

  32. [32]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Liu, M., Zhu, Y., Cai, H., Han, S., Ling, Z., Porikli, F., Su, H.: Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 21736–21746 (2023)

  33. [33]

    arXiv e-prints pp

    Liu, Y., Shekhar Dutt, N., Li, C., Mitra, N.J.: B-repler: Semantic b-rep latent editor using large language models. arXiv e-prints pp. arXiv–2508 (2025)

  34. [34]

    Advances in neural information processing systems32(2019)

    Lu,J.,Batra,D.,Parikh,D.,Lee,S.:Vilbert:Pretrainingtask-agnosticvisiolinguis- tic representations for vision-and-language tasks. Advances in neural information processing systems32(2019)

  35. [35]

    Advances in Neural Information Processing Systems36, 75307– 75337 (2023)

    Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pre- trained models. Advances in Neural Information Processing Systems36, 75307– 75337 (2023)

  36. [36]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ma, Z., Yue, Y., Gkioxari, G.: Find any part in 3d. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7818–7827 (2025)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Mei, G., Riz, L., Wang, Y., Poiesi, F.: Geometrically-driven aggregation for zero- shot 3d point cloud understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27896–27905 (2024)

  38. [38]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 815– 824 (2023)

  39. [39]

    In: Proceedings of the AAAI conference on artificial intelligence

    Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

  40. [40]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ren, Z., Huang, Z., Wei, Y., Zhao, Y., Fu, D., Feng, J., Jin, X.: Pixellm: Pixel rea- soning with large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26374–26383 (2024)

  42. [42]

    In: Proceedings of the IEEE international conference on computer vision

    Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE international conference on computer vision. pp. 945–953 (2015)

  43. [43]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  44. [44]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  45. [45]

    Dominant Set Clustering and Pooling for Multi-View 3D Object Recognition

    Wang, C., Pelillo, M., Siddiqi, K.: Dominant set clustering and pooling for multi- view 3d object recognition. arXiv preprint arXiv:1906.01592 (2019)

  46. [46]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: Cris: Clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11686–11695 (2022)

  47. [47]

    ACM Transactions on Graphics (TOG)40(4), 1–24 (2021) MV-GEL: Multi-View Geometric Entity Localization on Meshes 19

    Willis, K.D., Pu, Y., Luo, J., Chu, H., Du, T., Lambourne, J.G., Solar-Lezama, A., Matusik, W.: Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences. ACM Transactions on Graphics (TOG)40(4), 1–24 (2021) MV-GEL: Multi-View Geometric Entity Localization on Meshes 19

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: Generalized segmen- tation via multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3858–3869 (2024)

  49. [49]

    Cad-mllm: Unify- ing multimodality-conditioned cad generation with mllm.arXiv preprint arXiv:2411.04954,

    Xu, J., Wang, C., Zhao, Z., Liu, W., Ma, Y., Gao, S.: Cad-mllm: Uni- fying multimodality-conditioned cad generation with mllm. arXiv preprint arXiv:2411.04954 (2024)

  50. [50]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: Mattnet: Mod- ular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1307–1315 (2018)

  51. [51]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yuan, Y., Li, W., Liu, J., Tang, D., Luo, X., Qin, C., Zhang, L., Zhu, J.: Os- prey: Pixel understanding with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28202– 28211 (2024)

  52. [52]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B., Qiao, Y., Gao, P., Li, H.: Pointclip: Point cloud understanding by clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8552–8562 (2022)

  53. [53]

    arXiv preprint arXiv:2306.12156 (2023)

    Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., Tang, M., Wang, J.: Fast segment anything. arXiv preprint arXiv:2306.12156 (2023)

  54. [54]

    arXiv preprint arXiv:2312.03015 (2023)

    Zhou, Y., Gu, J., Li, X., Liu, M., Fang, Y., Su, H.: Partslip++: Enhancing low- shot 3d part segmentation via multi-view instance segmentation and maximum likelihood estimation. arXiv preprint arXiv:2312.03015 (2023)

  55. [55]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhu, X., Zhang, R., He, B., Guo, Z., Zeng, Z., Qin, Z., Zhang, S., Gao, P.: Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2639–2650 (2023)

  56. [56]

    Advances in neural information processing systems36, 19769–19782 (2023)

    Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., Lee, Y.J.: Segment everything everywhere all at once. Advances in neural information processing systems36, 19769–19782 (2023)