pith. sign in

arxiv: 2605.18193 · v1 · pith:LYF2KEZ3new · submitted 2026-05-18 · 💻 cs.CV · cs.GR

Best Segmentation Buddies for Image-Shape Correspondence

Pith reviewed 2026-05-20 11:07 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords image-shape correspondence3D segmentationfeature distillationsemantic matchingcross-modalitycomputer visionuntextured shapes
0
0 comments X

The pith

Distilling 2D vision features onto 3D shapes lets Best Segmentation Buddies match image segments to corresponding 3D parts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a way to connect pixels inside a 2D image segment to vertices on an untextured 3D shape that belong to the same semantic part. This connection must hold even when the image and the shape differ sharply in color, form, and viewing angle. The method first copies rich visual features learned by a 2D model onto every point on the 3D surface so that pixel-to-vertex similarity can be measured directly. It then selects the vertices whose closest matching pixel sits inside the image segment; these selected vertices are called Best Segmentation Buddies and serve as reliable anchors for semantic correspondence. The same transferred features are finally used to label the 3D shape into parts without any additional training.

Core claim

The central claim is that distilling deep visual features from a 2D vision model onto the 3D shape surface allows computation of feature similarity between image pixels and shape vertices. Identifying Best Segmentation Buddies—vertices whose most similar image pixel lies within the image segmentation region—enables reliable discovery of vertices in semantically corresponding shape parts across substantial differences in appearance, geometry, and viewpoint. The distilled features are also used to segment the shape directly in 3D, bootstrapping the correspondence process.

What carries the argument

Best Segmentation Buddies: 3D shape vertices whose nearest feature match in the 2D image falls inside the given image segment, used to locate semantically corresponding parts.

If this is right

  • The approach produces accurate and semantically meaningful correspondences for a wide range of image-shape pairs.
  • Distilled 3D features from a 2D image segmentation model can be used to segment the untextured 3D shape directly.
  • Correspondence remains reliable even when appearance, geometry, and viewpoint vary substantially.
  • The bootstrapping step reduces reliance on manual 3D annotations by transferring 2D segmentation knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation step could be applied frame-by-frame to video, yielding time-consistent 3D part labels.
  • The method supplies semantic anchors that might improve registration of 3D scans to casual photographs.
  • Testing the buddies on shapes that contain fine surface details or holes would show where feature transfer begins to break.
  • Because the 3D segmentation step needs no extra labels, the pipeline could help create large-scale labeled 3D datasets from existing 2D image collections.

Load-bearing premise

The assumption that feature similarity after distillation will place the nearest image pixel inside the correct semantic segment rather than being dominated by viewpoint or geometric differences.

What would settle it

On a collection of image-3D pairs that have hand-labeled ground-truth corresponding segments, count how often the identified Best Segmentation Buddies land outside the correct semantic region; if the error rate is no better than random selection the central claim is false.

Figures

Figures reproduced from arXiv: 2605.18193 by Dale Decatur, Dongwei Lyu, Itai Lang, Rana Hanocka.

Figure 1
Figure 1. Figure 1: Best Segmentation Buddies computes segment-to-segment correspondence across different modalities (image-to-shape) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Image-shape correspondence gallery. BSB can match semantic parts when the object in the image and the 3D mesh are from different domains, where the corresponding elements differ substantially in appearance, shape, and size. In this work, we address this problem by proposing a segmentation-to-segmentation correspondence method across modalities and domains, matching 2D image regions to 3D semantic parts. Un… view at source ↗
Figure 4
Figure 4. Figure 4: Best Segmentation Buddies. Pixel to vertex similarity: we visualize the similarity from a clicked pixel feature (left) to the distilled vision features on the mesh (right) with a heatmap (red being most similar and blue being least similar). Vertex to pixel similarity: we visualize the similarity from the distilled feature of the mesh vertex (left) to all the features in the object image re￾gion (right). D… view at source ↗
Figure 5
Figure 5. Figure 5: Best Segmentation Buddies matching properties. When a correspondence between an image region and a mesh part exists (left and middle), the matched vertex will map back to a segment (bottom row) that is almost identical to the original seg￾mentation (top row). However, if a match does not exist, such regions will differ substantially (right), implying the absence of correspondence. We discover this property… view at source ↗
Figure 6
Figure 6. Figure 6: Complete segment-to-segment correspondence. Our method is capable of generating a complete segmentation-to￾segmentation correspondence between an image and a shape (left). We can also match corresponding segmentations across a variety of images of different types (sketch, photo, and drawing), poses, and appearances (right). obtain the mask M2D q ′ , compute the Intersection over Union (IoU) with the mask o… view at source ↗
Figure 7
Figure 7. Figure 7: Shape to image correspondence. BSB is highly flexible and operates in both directions. In addition to matching an image segment to a 3D part, it can also match a 3D segmentation to the corresponding semantic image region. v. In our work, we use the best segmentation buddy vp to segment the mesh. The resulting region M3D vp is regarded as the matching 3D part for the 2D segment M2D p in the image, yielding … view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison. We adapt baselines to solve our task from existing techniques [3, 53]. These methods produce incorrect correspondences, whereas BSB reliably selects the shape part that semantically matches the target image segment. complete segmentation of the mesh. Quantitative evaluation. As far as we can ascertain, there is no annotated dataset for cross-modality image-shape seg￾ment corresponde… view at source ↗
Figure 8
Figure 8. Figure 8: Local texturing. Our image-to-shape matching enables automatic, localized texturing of the shape driven by the texture in the image. NBB [3] DIFT [53] Ours [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Matching the same image to different shapes. Our method can match regions from images to different shapes that contain significant differences in geometric structures (top) and across occlusions in orientation from the query image (bottom). Tab. 1 presents the matching success rate averaged over the evaluation image-shape pairs. NBB relies on a sparse set of mutual nearest neighbor pixels in the neural fe… view at source ↗
Figure 12
Figure 12. Figure 12: Texture robustness. BSB matches semantic regions between image and shape despite variations in their appearance and texture. In [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Interactive correspondence. Our BSB matching be￾tween pixel clicks and mesh vertices, combined with interactive 2D and 3D segmentation, enables to dynamically update the cross￾modality correspondence [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Matching between differently posed objects. BSB finds correspondence between the image and shape when the object in each modality differs substantially. the couch pair, [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Multi-region correspondence. BSB can match multi￾ple regions between the same image-shape pair, when the modali￾ties depict different objects (left), or distinguishing between simi￾lar parts of the objects and matching them correctly (right). Correspondence stability [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A single view to a complete 3D part. Although each image depicts only one view of the object (left), the entire corresponding part is successfully segmented in 3D (right). plied language-driven image segmentation by predicting a bounding box for an object part described by text, and seg￾menting the part within the bounding box [48]. Then, we used that part mask and its centroid as the pixel click with our… view at source ↗
Figure 17
Figure 17. Figure 17: Correspondence stability. Our method is robust to the location of the pixel click in the image region (left). Although different pixels are matched to different vertices, they fall within the corresponding semantic 3D part (right), resulting in a stable matching between the image and the shape. The clicked pixel and the matched vertices are visualized with a green dot [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 18
Figure 18. Figure 18: Different images to the same shape. BSB accurately matches segmentations from images that contain significant differences in geometry (e.g., the heart-handle on the left) and appearance (e.g., crochet hat on the right) to the same shape. semantic region of the shape, the 3D and 2D segments may not match [PITH_FULL_IMAGE:figures/full_fig_p016_18.png] view at source ↗
Figure 20
Figure 20. Figure 20: Backbone model versatility. BSB can utilize a vision backbone other than DINOv2 [43]. In this case, we lift a diffusion model features [53] to the 3D mesh for finding correspondences. The text prompts used to extract features for the images and the renderings of the shape are indicated next to them [PITH_FULL_IMAGE:figures/full_fig_p016_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Text-based 3D segmentation. Combining language-driven image segmentation with our method, we achieve 3D segmentation with text. The prompt above the image was used for its segmentation. at random, and rendered the shape from a set a views, with elevation of {−60◦ , 30◦ , 0 ◦ , 30◦ , 60◦}, and azimuth of {0 ◦ , 30◦ , ..., 330◦}, a total of 5 · 12 = 60 possible views. We randomly selected two of these views… view at source ↗
Figure 22
Figure 22. Figure 22: Missing shape part. If a segmented region in the image is missing a matching part in the shape, our method will output an empty 3D segmentation, indicating correctly that corre￾spondence does not exist in this case. the feature space as the match to the pixel click. This base￾line achieved a success rate of 0.73. We note that since no existing dataset provides ground-truth annotations for image-shape corr… view at source ↗
Figure 24
Figure 24. Figure 24: Correspondence comparison on PartNet. We show the generation process of the input image and 2D click (first three columns), the matched pixel by NBB and DIFT from the generated image to the rendered image of the shape and its unprojection to 3D (fourth to eighth columns), our matching vertex (ninth column) for the pixel click on the generated image, and the ground-truth shape region (tenth column) from wh… view at source ↗
Figure 25
Figure 25. Figure 25 [PITH_FULL_IMAGE:figures/full_fig_p018_25.png] view at source ↗
Figure 27
Figure 27. Figure 27: Nearest neighbor vertex selection. Selecting the nearest neighbor vertex for a pixel click in the image leads to erroneous correspondences. In contrast, our BSB overcomes the image-shape modality gap and finds correct matches. Method NBB DIFT NN Baseline BSB (ours) Effectiveness ↑ 2.75 2.74 3.26 4.63 [PITH_FULL_IMAGE:figures/full_fig_p019_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Different number of vertex candidates. We evaluate the matching success rate on PartNet for different values of vertex candidates. The performance starts to increase with a higher num￾ber of candidates and then saturates. corresponding vertex. F. Implementation Details Vision model distillation. We train a multi-layer percep￾tron (MLP) to map each mesh vertex to a DINOv2-like fea￾ture vector of size dvis … view at source ↗
read the original abstract

Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored task of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape. To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. Then, we identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D image segmentation model to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our project page is at https://threedle.github.io/bsb/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a pipeline for segmentation-to-segmentation correspondence between in-the-wild 2D images and untextured 3D shapes. It distills features from a pretrained 2D vision model onto the 3D surface, defines Best Segmentation Buddies as the 3D vertices whose nearest image pixel (by distilled feature distance) lies inside a given 2D segment, and uses the resulting correspondences to bootstrap direct 3D segmentation from the image segment. The authors claim the method produces accurate, semantically meaningful matches across large differences in appearance, geometry, and viewpoint.

Significance. If the central claim holds, the work would provide a practical bridge for cross-modal semantic correspondence without requiring texture or dense alignment, which is useful for graphics and vision applications involving untextured meshes. The distillation-plus-nearest-neighbor formulation is conceptually simple and leverages existing 2D models, but its value rests on whether the distilled features actually confer the claimed semantic invariance.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): the claim that Best Segmentation Buddies 'reliably discover vertices in semantically corresponding shape parts' is load-bearing, yet the manuscript supplies no quantitative metrics, success rates, or error analysis on any dataset to show that nearest-neighbor matches exceed a viewpoint/geometry baseline.
  2. [§4] §4 (experiments): no ablation is reported that isolates the distillation step from simply projecting raw 2D features onto the 3D surface; without this comparison it is impossible to verify that the nearest-pixel relation is driven by semantic part identity rather than residual viewpoint or surface-normal effects.
minor comments (2)
  1. [§3] The notation for feature similarity and the exact distillation procedure (e.g., which layers are used, how projection is performed) could be stated more explicitly with a short equation or pseudocode.
  2. [Figures] Figure captions and the project-page reference should include the specific image-shape pairs and ground-truth segments used for qualitative demonstration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional quantitative evaluation and ablation studies as suggested.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): the claim that Best Segmentation Buddies 'reliably discover vertices in semantically corresponding shape parts' is load-bearing, yet the manuscript supplies no quantitative metrics, success rates, or error analysis on any dataset to show that nearest-neighbor matches exceed a viewpoint/geometry baseline.

    Authors: We agree that the load-bearing claim would benefit from quantitative support. The current manuscript focuses on qualitative demonstrations across diverse in-the-wild image-shape pairs to show semantic correspondence. In the revision we will add quantitative metrics, including precision/recall for vertex-to-segment matching and error analysis on a test set of image-shape pairs with ground-truth annotations, with explicit comparison to a viewpoint/geometry baseline that omits distilled features. revision: yes

  2. Referee: [§4] §4 (experiments): no ablation is reported that isolates the distillation step from simply projecting raw 2D features onto the 3D surface; without this comparison it is impossible to verify that the nearest-pixel relation is driven by semantic part identity rather than residual viewpoint or surface-normal effects.

    Authors: We concur that isolating the distillation step is necessary to confirm its role in semantic invariance. The revised manuscript will include an ablation that directly compares the full pipeline (with distilled features) against a variant that projects raw 2D features onto the 3D surface without distillation, measuring the impact on nearest-neighbor correspondence accuracy and semantic consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline uses external pretrained model and explicit definitions without self-referential reductions.

full rationale

The paper presents a methodological pipeline: distill features from an external 2D vision model onto 3D surfaces, then define Best Segmentation Buddies via nearest-neighbor feature similarity within given 2D segments, and bootstrap 3D segmentation. No equations, fitted parameters, or self-citations are shown that would make the discovered correspondences equivalent to inputs by construction. The approach depends on independent external components and is not a closed derivation that reduces outputs to renamed inputs or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the transferability of 2D visual features to 3D geometry and on the assumption that nearest-neighbor lookup in feature space respects semantic boundaries.

axioms (1)
  • domain assumption Deep visual features extracted by a pretrained 2D vision model remain semantically meaningful when transferred to vertices of an untextured 3D mesh.
    Invoked when the paper states that distilling features onto the shape surface enables similarity computation between pixels and vertices.
invented entities (1)
  • Best Segmentation Buddies no independent evidence
    purpose: Vertices on the 3D shape whose nearest image pixel under distilled features lies inside the given 2D segment.
    New term and selection rule introduced to filter correspondences.

pith-pipeline@v0.9.0 · 5746 in / 1422 out tokens · 44393 ms · 2026-05-20T11:07:23.602509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 4 internal anchors

  1. [1]

    Zero-Shot 3D Shape Correspon- dence

    Ahmed Abdelreheem, Abdelrahman Eldesokey, Maks Ovs- janikov, and Peter Wonka. Zero-Shot 3D Shape Correspon- dence. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, New York, NY , USA, 2023. Association for Comput- ing Machinery. 1, 4

  2. [2]

    SATR: Zero-Shot Semantic Segmentation of 3D Shapes

    Ahmed Abdelreheem, Ivan Skorokhodov, Maks Ovsjanikov, and Peter Wonka. SATR: Zero-Shot Semantic Segmentation of 3D Shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 3

  3. [3]

    Neural Best-Buddies: Sparse Cross-Domain Correspondence.ACM Transactions on Graphics (TOG), 37(4):1–14, 2018

    Kfir Aberman, Jing Liao, Mingyi Shi, Dani Lischinski, Bao- quan Chen, and Daniel Cohen-Or. Neural Best-Buddies: Sparse Cross-Domain Correspondence.ACM Transactions on Graphics (TOG), 37(4):1–14, 2018. 1, 2, 3, 7, 17, 18, 20

  4. [4]

    Training-Free Open-V ocabulary Segmentation with Offline Diffusion- Augmented Prototype Generation

    Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Training-Free Open-V ocabulary Segmentation with Offline Diffusion- Augmented Prototype Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3689–3699, 2024. 3

  5. [5]

    Boscaini, J

    D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castel- lani, and P. Vandergheynst. Learning Class-Specific Descrip- tors for Deformable Shapes Using Localized Spectral Con- volutional Networks.Computer Graphics Forum, 34(5):13– 23, 2015. 3

  6. [6]

    Learning Shape Correspondence with Anisotropic Convolutional Neural Networks

    Davide Boscaini, Jonathan Masci, Emanuele Rodol `a, and Michael Bronstein. Learning Shape Correspondence with Anisotropic Convolutional Neural Networks. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2016. 3

  7. [7]

    BRIEF: Binary Robust Independent Elemen- tary Features

    Michael Calonder, Vincent Lepetit, Christophe Strecha, and Pascal Fua. BRIEF: Binary Robust Independent Elemen- tary Features. InEuropean conference on computer vision (ECCV), pages 778–792. Springer, 2010. 3

  8. [8]

    BAE-NET: Branched Autoen- coder for Shape Co-Segmentation

    Zhiqin Chen, Kangxue Yin, Matthew Fisher, Siddhartha Chaudhuri, and Hao Zhang. BAE-NET: Branched Autoen- coder for Shape Co-Segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8490–8499, 2019. 3

  9. [9]

    3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions

    Dale Decatur, Itai Lang, and Rana Hanocka. 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 20930–20939,

  10. [10]

    3D Paintbrush: Local Stylization of 3D Shapes with Cas- caded Score Distillation

    Dale Decatur, Itai Lang, Kfir Aberman, and Rana Hanocka. 3D Paintbrush: Local Stylization of 3D Shapes with Cas- caded Score Distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4473–4483, 2024. 3, 13

  11. [11]

    3D PixBrush: Image-Guided Local Texture Synthesis.arXiv preprint arXiv:2507.03731, 2025

    Dale Decatur, Itai Lang, Kfir Aberman, and Rana Hanocka. 3D PixBrush: Image-Guided Local Texture Synthesis.arXiv preprint arXiv:2507.03731, 2025. 8

  12. [12]

    Unsuper- vised Template-assisted Point Cloud Shape Correspondence Network

    Jiacheng Deng, Jiahao Lu, and Tianzhu Zhang. Unsuper- vised Template-assisted Point Cloud Shape Correspondence Network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5250–5259, 2024. 3

  13. [13]

    Deep Geometric Functional Maps: Robust Feature Learning for Shape Correspondence

    Nicolas Donati, Abhishek Sharma, and Maks Ovsjanikov. Deep Geometric Functional Maps: Robust Feature Learning for Shape Correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8589–8598, 2020. 1, 3

  14. [14]

    Beyond Cartesian Representations for Local Descriptors

    Patrick Ebel, Anastasiia Mishchuk, Kwang Moo Yi, Pascal Fua, and Eduard Trulls. Beyond Cartesian Representations for Local Descriptors. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 253–262, 2019. 3

  15. [15]

    Deep Shells: Unsupervised Shape Corre- spondence with Optimal Transport

    Marvin Eisenberger, Aysim Toker, Laura Leal-Taix ´e, and Daniel Cremers. Deep Shells: Unsupervised Shape Corre- spondence with Optimal Transport. InAdvances in Neural Information Processing Systems, pages 10491–10502. Cur- ran Associates, Inc., 2020. 3

  16. [16]

    DensePose: Dense Human Pose Estimation in the Wild

    Rıza Alp G ¨uler, Natalia Neverova, and Iasonas Kokkinos. DensePose: Dense Human Pose Estimation in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 4

  17. [17]

    Bron- stein, and Ron Kimmel

    Oshri Halimi, Or Litany, Emanuele Rodola, Alex M. Bron- stein, and Ron Kimmel. Unsupervised Learning of Dense Shape Correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 3

  18. [18]

    MeshCNN: A Network with an Edge.ACM Transactions on Graphics, 38(4):1–12,

    Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and Daniel Cohen-Or. MeshCNN: A Network with an Edge.ACM Transactions on Graphics, 38(4):1–12,

  19. [19]

    Unsupervised Semantic Correspondence Using Stable Diffu- sion

    Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. Unsupervised Semantic Correspondence Using Stable Diffu- sion. InAdvances in Neural Information Processing Systems, pages 8266–8279. Curran Associates, Inc., 2023. 15

  20. [20]

    COTR: Correspondence Trans- former for Matching Across Images

    Wei Jiang, Eduard Trulls, Jan Hosang, Andrea Tagliasac- chi, and Kwang Moo Yi. COTR: Correspondence Trans- former for Matching Across Images. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6207–6217, 2021. 3

  21. [21]

    3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans. Graph., 42(4):139– 1, 2023. 8

  22. [22]

    Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment Anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023. 2, 4, 6, 14, 15

  23. [23]

    PifPaf: Composite Fields for Human Pose Estimation

    Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. PifPaf: Composite Fields for Human Pose Estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4

  24. [24]

    Canonical Surface Mapping via Geometric Cycle Consis- tency

    Nilesh Kulkarni, Abhinav Gupta, and Shubham Tulsiani. Canonical Surface Mapping via Geometric Cycle Consis- tency. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 2202–2211,

  25. [25]

    Fouhey, and Shubham Tulsiani

    Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, and Shubham Tulsiani. Articulation-Aware Canonical Surface Mapping. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 4

  26. [26]

    DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction

    Itai Lang, Dvir Ginzburg, Shai Avidan, and Dan Raviv. DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction. InProceedings of the International Confer- ence on 3D Vision (3DV), pages 1442–1451, 2021. 3

  27. [27]

    iSeg: Interactive 3D Segmentation via Interac- tive Attention

    Itai Lang, Fei Xu, Dale Decatur, Sudarshan Babu, and Rana Hanocka. iSeg: Interactive 3D Segmentation via Interac- tive Attention. InSIGGRAPH Asia 2024 Conference Papers, page 1–11. Association for Computing Machinery, 2024. 2, 3, 4, 5, 6, 7, 16

  28. [28]

    SRFeat: Learning Locally Accurate and Globally Consistent Non- Rigid Shape Correspondence

    Lei Li, Souhaib Attaiki, and Maks Ovsjanikov. SRFeat: Learning Locally Accurate and Globally Consistent Non- Rigid Shape Correspondence. In2022 International Con- ference on 3D Vision (3DV), pages 144–154, 2022. 3

  29. [29]

    Bronstein, and Michael M

    Or Litany, Tal Remez, Emanuele Rodol`a, Alex M. Bronstein, and Michael M. Bronstein. Deep Functional Maps: Struc- tured Prediction for Dense Shape Correspondence. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5660–5668. IEEE Computer Society,

  30. [30]

    OpenShape: Scaling Up 3D Shape Representation To- wards Open-World Understanding

    Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xu- anlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. OpenShape: Scaling Up 3D Shape Representation To- wards Open-World Understanding. InAdvances in Neural Information Processing Systems, pages 44860–44879. Cur- ran Associates, Inc., 2023. 4

  31. [31]

    PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image- Language Models

    Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan Ling, Fatih Porikli, and Hao Su. PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image- Language Models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 21736–21746, 2023. 3

  32. [32]

    Distinctive Image Features from Scale- Invariant Keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004

    David G Lowe. Distinctive Image Features from Scale- Invariant Keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004. 3

  33. [33]

    Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence

    Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holyn- ski, and Trevor Darrell. Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence. In Advances in Neural Information Processing Systems, 2023. 15

  34. [34]

    Bronstein, and Pierre Vandergheynst

    Jonathan Masci, Davide Boscaini, Michael M. Bronstein, and Pierre Vandergheynst. Geodesic Convolutional Neural Networks on Riemannian Manifolds. InProceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, pages 37–45, 2015. 3

  35. [35]

    SHREC 2019: Matching Humans with Different Connectivity

    Simone Melzi, Riccardo Marin, Emanuele Rodol `a, Umberto Castellani, Jing Ren, Adrien Poulenard, Peter Wonka, and Maks Ovsjanikov. SHREC 2019: Matching Humans with Different Connectivity. InEurographics Workshop on 3D Object Retrieval, page 3. The Eurographics Association,

  36. [36]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. InProceedings of the European Conference on Computer Vision (ECCV), pages 405–421, 2020. 8

  37. [37]

    Working hard to know your neighbor’s mar- gins: Local descriptor learning loss

    Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Working hard to know your neighbor’s mar- gins: Local descriptor learning loss. InAdvances in Neural Information Processing Systems, pages 4826–4837. Curran Associates, Inc., 2017. 3

  38. [38]

    Bagdanov

    Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Andrew D. Bagdanov. Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modal- ity Inversion.arXiv preprint arXiv:2502.04263, 2025. 3

  39. [39]

    Chang, Li Yi, Sub- arna Tripathi, Leonidas J

    Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Sub- arna Tripathi, Leonidas J. Guibas, and Hao Su. PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchi- cal Part-Level 3D Object Understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 909–918, 2019. 6, 7, 16

  40. [40]

    Continu- ous Surface Embeddings

    Natalia Neverova, David Novotny, Marc Szafraniec, Vasil Khalidov, Patrick Labatut, and Andrea Vedaldi. Continu- ous Surface Embeddings. InAdvances in Neural Information Processing Systems, pages 17258–17270. Curran Associates, Inc., 2020. 1, 4, 7

  41. [41]

    Discovering Rela- tionships between Object Categories via Universal Canoni- cal Maps

    Natalia Neverova, Artsiom Sanakoyeu, Patrick Labatut, David Novotny, and Andrea Vedaldi. Discovering Rela- tionships between Object Categories via Universal Canoni- cal Maps. In2021 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 404–413, Los Alamitos, CA, USA, 2021. IEEE Computer Society. 4

  42. [43]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  43. [44]

    Neural Parts: Learning Expres- sive 3D Shape Abstractions with Invertible Neural Networks

    Despoina Paschalidou, Angelos Katharopoulos, Andreas Geiger, and Sanja Fidler. Neural Parts: Learning Expres- sive 3D Shape Abstractions with Invertible Neural Networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4521–4530,

  44. [45]

    Automatic Differentiation in PyTorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic Differentiation in PyTorch. InNIPS-W, 2017. 6

  45. [46]

    ASIA: Adaptive 3D Seg- mentation using Few Image Annotations.SIGGRAPH Asia Conference Papers, 2025

    Sai Raj Kishore Perla, Aditya V ora, Sauradip Nag, Ali Mahdavi-Amiri, and Hao Zhang. ASIA: Adaptive 3D Seg- mentation using Few Image Annotations.SIGGRAPH Asia Conference Papers, 2025. 7

  46. [47]

    PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. InAdvances in Neural Infor- mation Processing Systems. Curran Associates, Inc., 2017. 3

  47. [48]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:...

  48. [49]

    Toys4K 3D Object Dataset, 2022

    James Matthew Rehg. Toys4K 3D Object Dataset, 2022. https://github.com/rehg-lab/lowshot- shapebias/tree/main/toys4k. 6

  49. [50]

    Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J. Guibas. HuMoR: 3D Hu- man Motion Model for Robust Pose Estimation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 11488–11499, 2021. 4

  50. [51]

    ExtrudeNet: Unsupervised Inverse Sketch- and-Extrude for Shape Parsing

    Daxuan Ren, Jianmin Zheng, Jianfei Cai, Jiatong Li, and Junzhe Zhang. ExtrudeNet: Unsupervised Inverse Sketch- and-Extrude for Shape Parsing. InProceedings of the 17th European Conference on Computer Vision (ECCV). Springer, 2022. 3

  51. [52]

    SHIC: Shape-Image Correspondences with no Key- point Supervision

    Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. SHIC: Shape-Image Correspondences with no Key- point Supervision. InEuropean Conference on Computer Vision, pages 129–145. Springer, 2024. 1, 2, 4, 7

  52. [53]

    Emergent Correspondence from Image Diffusion

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent Correspondence from Image Diffusion. InAdvances in Neural Information Processing Systems, 2023. 3, 6, 7, 15, 16, 17, 18, 20

  53. [54]

    SOSNet: Second Order Similarity Regularization for Local Descriptor Learning

    Yurun Tian, Xin Yu, Bin Fan, Fuchao Wu, Huub Heijnen, and Vassileios Balntas. SOSNet: Second Order Similarity Regularization for Local Descriptor Learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11016–11025, 2019. 3

  54. [55]

    TurboSquid 3D Model Repository, 2021

    TurboSquid. TurboSquid 3D Model Repository, 2021. https://www.turbosquid.com/. 6

  55. [56]

    Prior Knowledge for Part Correspondence.Com- puter Graphics Forum, 30(2):553–562, 2011

    Oliver van Kaick, Andrea Tagliasacchi, Oana Sidi, Hao Zhang, Daniel Cohen-Or, Lior Wolf, and Ghassan Hamarneh. Prior Knowledge for Part Correspondence.Com- puter Graphics Forum, 30(2):553–562, 2011. 6

  56. [57]

    Sclip: Rethinking self- attention for dense vision-language inference,

    Feng Wang, Jieru Mei, and Alan Yuille. SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference.arXiv preprint arXiv:2312.01597, 2024. 3

  57. [58]

    Diffusion Model is Secretly a Training-Free Open V ocabulary Semantic Seg- menter.IEEE Transactions on Image Processing, 34:1895– 1907, 2025

    Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion Model is Secretly a Training-Free Open V ocabulary Semantic Seg- menter.IEEE Transactions on Image Processing, 34:1895– 1907, 2025. 3

  58. [59]

    SegGPT: Towards Seg- menting Everything in Context

    Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. SegGPT: Towards Seg- menting Everything in Context. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1130–1140, 2023. 2

  59. [60]

    Sarma, Michael M

    Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic Graph CNN for Learning on Point Clouds.ACM Trans. Graph., 38(5), 2019. 3

  60. [61]

    Dense Human Body Correspondences Us- ing Convolutional Networks

    Lingyu Wei, Qixing Huang, Duygu Ceylan, Etienne V ouga, and Hao Li. Dense Human Body Correspondences Us- ing Convolutional Networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1544–1553, 2016. 3

  61. [62]

    3D ShapeNets: A Deep Representation for V olumetric Shapes

    Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D ShapeNets: A Deep Representation for V olumetric Shapes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, 2015. 3

  62. [63]

    LIFT: Learned Invariant Feature Transform

    Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. LIFT: Learned Invariant Feature Transform. InEuro- pean Conference on Computer Vision (ECCV), pages 467–

  63. [64]

    Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling

    Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling. InPro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 19313–19322. IEEE, 2022. 3

  64. [65]

    PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop

    Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, and Zhenan Sun. PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop. In2021 IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 11426–11436, 2021. 4

  65. [66]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. DINO: DETR with Improved DeNoising Anchor Boxes for End- to-End Object Detection.arXiv preprint arXiv:2203.03605,

  66. [67]

    A Tale of Two Features: Stable Diffusion Comple- ments DINO for Zero-Shot Semantic Correspondence

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A Tale of Two Features: Stable Diffusion Comple- ments DINO for Zero-Shot Semantic Correspondence. In Advances in Neural Information Processing Systems, 2023. 15

  67. [68]

    Adding Conditional Control to Text-to-Image Diffusion Models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. 7, 17

  68. [69]

    Extract Free Dense Labels from CLIP

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract Free Dense Labels from CLIP. InProceedings of the 17th Euro- pean Conference on Computer Vision (ECCV), pages 696– 712, Cham, 2022. Springer Nature Switzerland. 3

  69. [70]

    Thingi10K: A Dataset of 10,000 3D-Printing Models

    Qingnan Zhou and Alec Jacobson. Thingi10K: A Dataset of 10,000 3D-Printing Models.arXiv preprint arXiv:1605.04797, 2016. 6

  70. [71]

    Segment Everything Everywhere All at Once

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment Everything Everywhere All at Once. In Advances in Neural Information Processing Systems, pages 19769–19782. Curran Associates, Inc., 2023. 2 Best Segmentation Buddies for Image-Shape Correspondence Supplementary Material The followi...

  71. [72]

    An image of an airplane facing away

    with a box input, where the user specifies the top-left and bottom-right coordinates in the image to segment the part maskm 2D p used in our matching scheme. Examples are shown in Fig. 19. Another interface for segmenting the image is text, as we describe next. Text to 3D segmentation.In the main paper, we used a click-based model for segmenting the image...