MV-GEL: Language-Driven Multi-View Geometric Entity Localization on Meshes
Pith reviewed 2026-07-01 05:54 UTC · model grok-4.3
The pith
Selecting language-guided viewpoints allows accurate localization of edges and faces on 3D meshes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MV-GEL shows that reliable localization of fine-grained geometric entities such as faces, edges, and solids on meshes from language queries requires a prompt-conditioned ranking of viewpoints to maximize interpretability, followed by VLM-based mask prediction and geometry-aware lifting to the mesh.
What carries the argument
GELviews, the prompt-conditioned ranking module that prioritizes viewpoints based on language prompted observability of geometric CAD entities.
If this is right
- Face-level IoU improves by up to 1.7 times over vanilla baselines.
- Edge-level F1 scores increase by over 4.5 times, especially for thin structures.
- The framework outperforms both CLIP-based and random view sampling methods.
- It operates on any 3D mesh without requiring CAD-specific information.
- Predicted masks from selected views are lifted accurately via ray casting.
Where Pith is reading between the lines
- Integrating this view selection with robotic planning could enable language-based manipulation of specific object parts.
- Applying similar ranking to time-varying scenes might extend the approach to video or animation data.
- The reliance on mesh geometry suggests potential adaptation to other 3D formats like voxels if projection methods are adjusted.
- The performance gains on view-sensitive structures indicate broader applicability in scientific visualization tasks.
Load-bearing premise
Reliable CAD entity localization depends on selecting views that make the queried entity maximally interpretable.
What would settle it
Running the localization task on the evaluation meshes using only randomly selected views and finding no significant drop in IoU or F1 scores compared to the ranked views would falsify the importance of the view selection step.
Figures
read the original abstract
Identifying and grounding precise geometric entities, such as edges, planar regions, and curved surfaces within 3D objects, is foundational to computer-aided design (CAD), robotic manipulation, and scientific simulation. Although modern Vision Language Models (VLMs) have advanced referring segmentation (RIS) in the image domain, extending such language-driven localization to structured 3D geometry is substantially harder. The 3D object appearance is highly sensitive to viewpoints; a single perspective may render a target entity clearly observable, while another may suffer from severe occlusion or foreshortening. In this work, we attempt to solve these challenges with MV-GEL (Multi-View Geometric Entity Localization), a framework for localizing fine-grained geometric entities on polygon meshes from natural language queries. Our key insight is that reliable CAD entity (i.e., faces, edges or solids) localization depends on selecting views that make the queried entity maximally interpretable. We introduce GELviews, a prompt-conditioned ranking module that prioritizes viewpoints based on language prompted observability of geometric CAD entities. Selected views are processed by a VLM-based reasoning segmentation backbone, and predicted masks are lifted to the corresponding meshes via geometry-aware ray casting. Our framework is completely CAD agnostic and relies only on 3D meshes. Experiments show up to a 1.7X improvement in face-level IoU and over 4.5X gains in edge-level F1 compared to vanilla baselines, substantially outperforming CLIP-based and random view sampling, particularly for thin and view-sensitive structures.The dataset, code and trained checkpoints are available at https://github.com/kbali1297/MV-GEL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MV-GEL, a framework for language-driven localization of geometric entities (faces, edges, solids) on 3D polygon meshes. The key component is GELviews, a prompt-conditioned module that ranks viewpoints for maximal interpretability of the queried entity. Selected views are segmented using a VLM-based reasoning segmentation model, and the 2D masks are lifted to the mesh using geometry-aware ray casting. The method is claimed to be mesh-only and CAD-agnostic. Experiments report up to 1.7× face-level IoU and 4.5× edge-level F1 improvements over baselines including random and CLIP-based view sampling.
Significance. If the empirical gains hold under rigorous evaluation, the work could provide a practical advance in referring segmentation for structured 3D geometry, with potential applications in CAD, robotics, and simulation. The release of dataset, code, and checkpoints is a positive factor for reproducibility.
major comments (2)
- [Abstract] Abstract: The abstract states quantitative gains (up to 1.7X face-level IoU and over 4.5X edge-level F1) but supplies no experimental protocol, dataset details, baseline definitions, error bars, or statistical tests. This prevents evaluation of the central claim that GELviews outperforms random and CLIP-based sampling, particularly for thin structures.
- [§3] §3 (Method): The description of GELviews as a 'prompt-conditioned ranking module' that prioritizes viewpoints based on language-prompted observability lacks sufficient detail on its architecture, input features, training procedure, or scoring function. Without this, it is unclear how the module differs from standard VLM prompting or whether it introduces hidden parameters that contradict the 'CAD agnostic' claim.
minor comments (2)
- [Abstract] Abstract: The acronym 'GELviews' is introduced without an immediate parenthetical expansion or definition, reducing readability on first encounter.
- [Abstract] The paper states that 'the dataset, code and trained checkpoints are available' but provides no link or citation in the abstract; ensure the GitHub URL appears in the main text and is correctly formatted.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states quantitative gains (up to 1.7X face-level IoU and over 4.5X edge-level F1) but supplies no experimental protocol, dataset details, baseline definitions, error bars, or statistical tests. This prevents evaluation of the central claim that GELviews outperforms random and CLIP-based sampling, particularly for thin structures.
Authors: The abstract is written to be concise and highlight key results within typical length constraints. Full experimental details—including the mesh dataset derived from CAD models, baseline definitions (random sampling and CLIP-based view selection), evaluation protocol with face IoU and edge F1, error bars from repeated runs, and statistical comparisons—are provided in Section 4 and the supplementary material. We will revise the abstract to include a short clause referencing the evaluation metrics and dataset to improve standalone readability. revision: yes
-
Referee: [§3] §3 (Method): The description of GELviews as a 'prompt-conditioned ranking module' that prioritizes viewpoints based on language-prompted observability lacks sufficient detail on its architecture, input features, training procedure, or scoring function. Without this, it is unclear how the module differs from standard VLM prompting or whether it introduces hidden parameters that contradict the 'CAD agnostic' claim.
Authors: GELviews operates as a zero-shot ranking procedure that feeds rendered views and the language prompt into an off-the-shelf VLM to obtain observability scores; no additional architecture, trainable parameters, or fine-tuning is introduced. The scoring function directly uses the VLM's output logits or reasoning trace for entity visibility, preserving the CAD-agnostic property since only mesh rendering is required. We will expand the description in §3 with explicit pseudocode for the ranking step and a statement confirming the absence of hidden parameters. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces MV-GEL as a framework combining a prompt-conditioned GELviews ranking module, VLM-based segmentation, and geometry-aware ray casting for lifting masks to meshes. No mathematical derivations, equations, or first-principles predictions are claimed that reduce to self-definition, fitted parameters renamed as outputs, or load-bearing self-citations. The approach relies on external VLM capabilities and standard mesh operations, with performance gains demonstrated empirically against independent baselines (random and CLIP sampling). The method is described as mesh-only and CAD-agnostic, making the contribution self-contained without any step that is equivalent to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Vision-language models can perform referring image segmentation on rendered views of 3D meshes
- domain assumption Geometry-aware ray casting can accurately transfer 2D segmentation masks onto 3D mesh surfaces
invented entities (1)
-
GELviews
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Abdelreheem, A., Skorokhodov, I., Ovsjanikov, M., Wonka, P.: Satr: Zero-shot se- mantic segmentation of 3d shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15166–15179 (2023)
2023
-
[2]
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Chen, Q., Wu, C., Ji, J., Ma, Y., Yang, D., Sun, X.: Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2132–2140 (2025)
2025
-
[4]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition
Chen, X., Li, Q., Wang, T., Xue, T., Pang, J.: Gennbv: Generalizable next-best- view policy for active 3d reconstruction. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 16436–16445 (2024)
2024
-
[5]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
2024
-
[6]
In: Proceedings
Connolly, C.: The determination of next best views. In: Proceedings. 1985 IEEE international conference on robotics and automation. vol. 2, pp. 432–435. IEEE (1985)
1985
-
[7]
Advances in neural information processing systems36, 49250–49267 (2023)
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)
2023
-
[8]
In: Proceedings of the IEEE/CVF international conference on computer vision
Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query gen- eration for referring segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 16321–16330 (2021)
2021
-
[9]
In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Garosi, M., Tedoldi, R., Boscaini, D., Mancini, M., Sebe, N., Poiesi, F.: 3d part segmentation via geometric aggregation of 2d visual features. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 3257–3267. IEEE (2025)
2025
-
[10]
In: International conference on machine learning
Goyal, A., Law, H., Liu, B., Newell, A., Deng, J.: Revisiting point cloud shape classification with a simple and effective baseline. In: International conference on machine learning. pp. 3809–3820. PMLR (2021)
2021
-
[11]
arXiv preprint arXiv:2309.00615 (2023)
Guo, Z., Zhang, R., Zhu, X., Tang, Y., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., et al.: Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)
-
[12]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Guo, Z., Tang, Y., Zhang, R., Wang, D., Wang, Z., Zhao, B., Li, X.: Viewrefer: Grasp the multi-view knowledge for 3d visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15372–15383 (2023)
2023
-
[13]
arXiv preprint arXiv:2601.02457 (2026)
Hadgi, S., Gong, B., Sundararaman, R., Pierson, E., Li, L., Wonka, P., Ovsjanikov, M.: Patchalign3d: Local feature alignment for dense 3d shape understanding. arXiv preprint arXiv:2601.02457 (2026)
-
[14]
arXiv preprint arXiv:2508.08252 (2025)
He, S., Jie, G., Wang, C., Zhou, Y., Hu, S., Li, G., Ding, H.: Refersplat: Referring segmentation in 3d gaussian splatting. arXiv preprint arXiv:2508.08252 (2025)
-
[15]
Jatavallabhula, K.M., Kuwajerwala, A., Gu, Q., Omama, M., Chen, T., Maalouf, A., Li, S., Iyer, G., Saryazdi, S., Keetha, N., et al.: Conceptfusion: Open-set mul- timodal 3d mapping. arXiv preprint arXiv:2302.07241 (2023)
-
[16]
arXiv preprint arXiv:2203.13944 (2022) MV-GEL: Multi-View Geometric Entity Localization on Meshes 17
Jayaraman, P.K., Lambourne, J.G., Desai, N., Willis, K.D., Sanghi, A., Morris, N.J.: Solidgen: An autoregressive model for direct b-rep synthesis. arXiv preprint arXiv:2203.13944 (2022) MV-GEL: Multi-View Geometric Entity Localization on Meshes 17
-
[17]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Jayaraman, P.K., Sanghi, A., Lambourne, J.G., Willis, K.D., Davies, T., Shayani, H., Morris, N.: Uv-net: Learning from boundary representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11703–11712 (2021)
2021
-
[18]
In: International conference on machine learning
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)
2021
-
[19]
ACM Transactions on Graphics (TOG)40(6), 1–18 (2021)
Jones, B., Hildreth, D., Chen, D., Baran, I., Kim, V.G., Schulz, A.: Automate: A dataset and learning approach for automatic mating of cad assemblies. ACM Transactions on Graphics (TOG)40(6), 1–18 (2021)
2021
-
[20]
In: European Conference on Computer Vision
Kareem, A., Lahoud, J., Cholakkal, H.: Paris3d: Reasoning-based 3d part seg- mentation using large multimodal model. In: European Conference on Computer Vision. pp. 466–482. Springer (2024)
2024
-
[21]
In: Proceedings of the IEEE/CVF international conference on computer vision
Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language em- bedded radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 19729–19739 (2023)
2023
-
[22]
Ad- vances in Neural Information Processing Systems37, 7552–7579 (2024)
Khan, M.S., Sinha, S., Sheikh, T.U., Stricker, D., Ali, S.A., Afzal, M.Z.: Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts. Ad- vances in Neural Information Processing Systems37, 7552–7579 (2024)
2024
-
[23]
In: Proceedings of the IEEE/CVF international conference on computer vision
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)
2023
-
[24]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Koch, S., Matveev, A., Jiang, Z., Williams, F., Artemov, A., Burnaev, E., Alexa, M., Zorin, D., Panozzo, D.: Abc: A big cad model dataset for geometric deep learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9601–9611 (2019)
2019
-
[25]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning seg- mentation via large language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9579–9589 (2024)
2024
-
[26]
Lambourne, J.G., Willis, K.D., Jayaraman, P.K., Sanghi, A., Meltzer, P., Shayani, H.:Brepnet:Atopologicalmessagepassingsystemforsolidmodels.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12773–12782 (2021)
2021
-
[27]
In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition
Li, J., Ma, W., Li, X., Lou, Y., Zhou, G., Zhou, X.: Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 18563–18573 (2025)
2025
-
[28]
In: International conference on machine learning
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)
2023
-
[29]
In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition
Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10965–10975 (2022)
2022
-
[30]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, C., Ding, H., Jiang, X.: Gres: Generalized referring expression segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23592–23601 (2023)
2023
-
[31]
Advances in neural information processing systems36, 34892–34916 (2023) 18 K
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 18 K. Bali et al
2023
-
[32]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition
Liu, M., Zhu, Y., Cai, H., Han, S., Ling, Z., Porikli, F., Su, H.: Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 21736–21746 (2023)
2023
-
[33]
arXiv e-prints pp
Liu, Y., Shekhar Dutt, N., Li, C., Mitra, N.J.: B-repler: Semantic b-rep latent editor using large language models. arXiv e-prints pp. arXiv–2508 (2025)
2025
-
[34]
Advances in neural information processing systems32(2019)
Lu,J.,Batra,D.,Parikh,D.,Lee,S.:Vilbert:Pretrainingtask-agnosticvisiolinguis- tic representations for vision-and-language tasks. Advances in neural information processing systems32(2019)
2019
-
[35]
Advances in Neural Information Processing Systems36, 75307– 75337 (2023)
Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pre- trained models. Advances in Neural Information Processing Systems36, 75307– 75337 (2023)
2023
-
[36]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Ma, Z., Yue, Y., Gkioxari, G.: Find any part in 3d. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7818–7827 (2025)
2025
-
[37]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Mei, G., Riz, L., Wang, Y., Poiesi, F.: Geometrically-driven aggregation for zero- shot 3d point cloud understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27896–27905 (2024)
2024
-
[38]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 815– 824 (2023)
2023
-
[39]
In: Proceedings of the AAAI conference on artificial intelligence
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
2018
-
[40]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
2021
-
[41]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ren, Z., Huang, Z., Wei, Y., Zhao, Y., Fu, D., Feng, J., Jin, X.: Pixellm: Pixel rea- soning with large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26374–26383 (2024)
2024
-
[42]
In: Proceedings of the IEEE international conference on computer vision
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE international conference on computer vision. pp. 945–953 (2015)
2015
-
[43]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Advances in neural information pro- cessing systems30(2017)
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)
2017
-
[45]
Dominant Set Clustering and Pooling for Multi-View 3D Object Recognition
Wang, C., Pelillo, M., Siddiqi, K.: Dominant set clustering and pooling for multi- view 3d object recognition. arXiv preprint arXiv:1906.01592 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[46]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: Cris: Clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11686–11695 (2022)
2022
-
[47]
ACM Transactions on Graphics (TOG)40(4), 1–24 (2021) MV-GEL: Multi-View Geometric Entity Localization on Meshes 19
Willis, K.D., Pu, Y., Luo, J., Chu, H., Du, T., Lambourne, J.G., Solar-Lezama, A., Matusik, W.: Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences. ACM Transactions on Graphics (TOG)40(4), 1–24 (2021) MV-GEL: Multi-View Geometric Entity Localization on Meshes 19
2021
-
[48]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: Generalized segmen- tation via multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3858–3869 (2024)
2024
-
[49]
Xu, J., Wang, C., Zhao, Z., Liu, W., Ma, Y., Gao, S.: Cad-mllm: Uni- fying multimodality-conditioned cad generation with mllm. arXiv preprint arXiv:2411.04954 (2024)
-
[50]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: Mattnet: Mod- ular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1307–1315 (2018)
2018
-
[51]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yuan, Y., Li, W., Liu, J., Tang, D., Luo, X., Qin, C., Zhang, L., Zhu, J.: Os- prey: Pixel understanding with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28202– 28211 (2024)
2024
-
[52]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B., Qiao, Y., Gao, P., Li, H.: Pointclip: Point cloud understanding by clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8552–8562 (2022)
2022
-
[53]
arXiv preprint arXiv:2306.12156 (2023)
Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., Tang, M., Wang, J.: Fast segment anything. arXiv preprint arXiv:2306.12156 (2023)
-
[54]
arXiv preprint arXiv:2312.03015 (2023)
Zhou, Y., Gu, J., Li, X., Liu, M., Fang, Y., Su, H.: Partslip++: Enhancing low- shot 3d part segmentation via multi-view instance segmentation and maximum likelihood estimation. arXiv preprint arXiv:2312.03015 (2023)
-
[55]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhu, X., Zhang, R., He, B., Guo, Z., Zeng, Z., Qin, Z., Zhang, S., Gao, P.: Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2639–2650 (2023)
2023
-
[56]
Advances in neural information processing systems36, 19769–19782 (2023)
Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., Lee, Y.J.: Segment everything everywhere all at once. Advances in neural information processing systems36, 19769–19782 (2023)
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.