Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

Adam Kortylewski; Artur Jesslen; Olaf D\"unkel

arxiv: 2605.30093 · v1 · pith:4Q3FHWZHnew · submitted 2026-05-28 · 💻 cs.CV

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

Artur Jesslen , Olaf D\"unkel , Adam Kortylewski This is my paper

Pith reviewed 2026-06-29 08:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic correspondence3D foundation modelsSAM3Dgeodesic distance filteringDINO featuresStable Diffusion featurespost-training adapterrender-and-compare

0 comments

The pith

3D geometry estimates from SAM3D and geodesic filtering on reconstructed meshes supply reliable supervision that lets a lightweight adapter improve semantic correspondence on top of DINO and Stable Diffusion features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that 2D foundation features alone are insufficient for semantic correspondence because they confuse symmetric sides and repeated parts that are distinct in three dimensions. It shows that adding instance-specific 3D structure obtained via SAM3D, refined by render-and-compare, and used both for rendered PartField descriptors and for geodesic-distance filtering produces better training signals than prior methods that rely on coarse spherical geometry or manual pose labels. A sympathetic reader would care because semantic correspondence underpins many downstream tasks such as object tracking and 3D reconstruction, and the method reduces the amount of manual geometric supervision required. The core mechanism is automatic extraction of per-instance 3D priors that complement existing 2D features without changing the underlying foundation models.

Core claim

Given an image, SAM3D estimates object geometry and pose; render-and-compare optimization refines the pose; PartField descriptors are rendered into the image plane; geodesic distances on the mesh filter candidate correspondences; and the filtered matches supervise a lightweight adapter on DINO and Stable Diffusion features. This pipeline yields higher semantic correspondence accuracy than prior post-training methods while eliminating the need for pose annotations and coarse spherical geometry.

What carries the argument

The 3D-aware post-training framework that renders PartField descriptors from SAM3D-reconstructed meshes and uses geodesic distances to filter correspondences for adapter supervision.

If this is right

Correspondence learning becomes robust to symmetry and repeated parts without explicit 3D annotations.
Training no longer requires manual pose labels or spherical geometry assumptions.
The same filtered matches can be reused to adapt other 2D foundation features beyond DINO and Stable Diffusion.
Instance-specific 3D structure guides matching even when objects appear in novel viewpoints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometry-aware filtering step could be applied to other correspondence-heavy tasks such as optical flow or keypoint matching.
If SAM3D improves over time, the adapter training signal strengthens automatically without changes to the rest of the pipeline.
The approach may reduce the performance gap between 2D and full 3D correspondence methods on datasets with strong symmetries.

Load-bearing premise

SAM3D must produce geometry and pose estimates accurate enough for render-and-compare optimization and geodesic filtering to generate trustworthy supervision signals.

What would settle it

A controlled test in which SAM3D geometry estimates are deliberately degraded or replaced by random poses, after which the reported gains in semantic correspondence accuracy disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.30093 by Adam Kortylewski, Artur Jesslen, Olaf D\"unkel.

**Figure 1.** Figure 1: 3D foundation priors improve both candidate generation and filtering of semantic correspondences. Existing zero-shot pipelines based on SD+DINO (a) suffer from left–right and repeated-part confusion, producing many incorrect matches. Adding our geodesic filter (b) removes wrong matches but is bottlenecked by feature quality, often leaving few surviving correspondences. Adding PartField features (c) yields … view at source ↗

**Figure 2.** Figure 2: Canonicalized 3D object reconstruction pipeline. Given an image, we obtain an instance mask and a mesh from foundation models. We then refine the mesh pose via a two-phase render-andcompare optimization based on a distance-transform (DT) and a soft-IoU phase. Finally, we resolve the residual four-fold yaw ambiguity by rendering the mesh at eight known orientations and applying OrientAnything V2 with major… view at source ↗

**Figure 3.** Figure 3: Pseudo-label correspondences pipeline. Given two images, we fuse DINO, SD, and PartField features (rasterized from the meshes of Section 3.1) and propose candidate matches via nearest-neighbor (NN) search with relaxed cyclic consistency (c.c). Each candidate is then geometrically verified by lifting the matched pixels onto the reconstructed meshes and computing the geodesic error d s⇄t geo ; candidates exc… view at source ↗

**Figure 4.** Figure 4: Qualitative pseudo-annotations. We visualize pseudo-ground-truth annotations from 3D-SC and DIY-SC. 3D-SC produces denser and more geometrically consistent pseudo-annotations. and Weakly supervised approaches. DIFT [23], and SD + DINOv2 [43] extract features from foundation models and perform nearest-neighbor matching in feature space. Spherical mapper [24] and DIY-SC [8] both leverage pose annotations as… view at source ↗

read the original abstract

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper outlines a pipeline using SAM3D geometry and geodesic filtering to supervise an adapter on DINO and diffusion features, but the abstract supplies no numbers or ablations to check if it actually works.

read the letter

The main takeaway is that this paper describes a post-training method to add 3D structure to semantic correspondence. It runs SAM3D to recover per-instance meshes and poses, refines the pose with render-and-compare, renders PartField descriptors into the image, and uses geodesic distances on the mesh to filter candidate matches. Those filtered pairs then supervise a lightweight adapter on top of DINO and Stable Diffusion features.

What stands out as new is the concrete combination of SAM3D-driven reconstruction, PartField projection, and geodesic filtering to generate supervision without manual pose labels. Earlier post-training work often needed coarse spherical geometry or explicit annotations; this approach tries to get instance-specific 3D structure automatically.

The description of the pipeline is clear and the motivation for handling symmetric parts and repeated structures makes sense on paper. The method also avoids some of the manual geometric supervision that prior approaches required.

The soft spot is the complete absence of quantitative results, datasets, error bars, or ablations in the text provided. The claim that the approach improves correspondence and reduces manual supervision is stated but cannot be checked. The stress-test point about SAM3D accuracy is on target: if the reconstructed meshes or poses are off, the rendered PartField maps will misalign and the geodesic filter will pass noisy pairs as supervision. Without numbers on reconstruction quality or an ablation on that step, it is impossible to know whether the claimed gains are real.

The paper is aimed at computer-vision researchers who work on semantic matching and foundation-model adaptation for correspondence tasks. Someone already running DINO or diffusion features on matching benchmarks might pick up the pipeline idea, but the lack of evidence makes it hard to judge practical value.

I would not recommend sending this for peer review until the full results section is available and shows clear, reproducible gains with proper controls.

Referee Report

3 major / 1 minor

Summary. The paper introduces a 3D-aware post-training framework for semantic correspondence that leverages SAM3D to recover per-instance geometry and pose (refined via render-and-compare), renders PartField descriptors into the image plane, and applies geodesic distances on the reconstructed mesh to filter candidate matches. These filtered pairs serve as supervision to train a lightweight adapter atop DINO and Stable Diffusion features. The central claim is that the resulting geometry-aware features improve semantic correspondence accuracy over prior 2D-only post-training methods while reducing the need for manual pose annotations or coarse spherical geometry.

Significance. If the empirical claims hold, the work would demonstrate a practical route to injecting instance-specific 3D structure into existing 2D foundation features, addressing well-known failure modes on symmetric and repeated-part objects. The automatic acquisition of 3D priors and the public release of code and models are concrete strengths that would facilitate follow-up work.

major comments (3)

[Abstract / Method] Abstract and method description: the improvement claim rests on the premise that SAM3D plus render-and-compare yields geometry and pose accurate enough for geodesic filtering to retain true semantic matches and discard false positives; however, no quantitative SAM3D reconstruction or pose-error statistics are reported on the correspondence evaluation sets, leaving the reliability of the supervision signal unverified.
[Abstract] Abstract: the statement that 'Experiments show that our approach improves semantic correspondence over the prior methods' is unsupported by any tables, figures, metrics, error bars, dataset descriptions, or ablation results in the manuscript text, rendering the central empirical claim unverifiable from the provided content.
[Method] Method: because the pipeline composes multiple external foundation models (SAM3D, PartField, DINO, Stable Diffusion) whose outputs are used directly as supervision, the absence of an ablation isolating the contribution of the geodesic-filtered 3D component versus the base 2D features makes it impossible to attribute gains specifically to the 3D priors.

minor comments (1)

[Abstract] The GitHub URL in the abstract is written as 'https:/github.com/GenIntel/3D-SC' (single slash after https:); this should be corrected to the standard 'https://' form.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical grounding of our 3D-aware post-training framework. We address each major comment below and commit to revisions that strengthen the verification of the supervision signal and the attribution of gains.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the improvement claim rests on the premise that SAM3D plus render-and-compare yields geometry and pose accurate enough for geodesic filtering to retain true semantic matches and discard false positives; however, no quantitative SAM3D reconstruction or pose-error statistics are reported on the correspondence evaluation sets, leaving the reliability of the supervision signal unverified.

Authors: We agree that quantitative validation of SAM3D reconstruction and pose accuracy on the exact correspondence benchmarks is needed to confirm the supervision quality. In the revision we will report mean rotation/translation errors, reconstruction IoU, and render-and-compare convergence statistics computed directly on the SPair-71k and PF-PASCAL evaluation images. revision: yes
Referee: [Abstract] Abstract: the statement that 'Experiments show that our approach improves semantic correspondence over the prior methods' is unsupported by any tables, figures, metrics, error bars, dataset descriptions, or ablation results in the manuscript text, rendering the central empirical claim unverifiable from the provided content.

Authors: The Experiments section of the full manuscript presents the supporting tables, figures, and dataset details. To make the abstract claim immediately verifiable, we will insert explicit cross-references (e.g., “see Table 2”) and a one-sentence summary of key metrics within the abstract itself. revision: partial
Referee: [Method] Method: because the pipeline composes multiple external foundation models (SAM3D, PartField, DINO, Stable Diffusion) whose outputs are used directly as supervision, the absence of an ablation isolating the contribution of the geodesic-filtered 3D component versus the base 2D features makes it impossible to attribute gains specifically to the 3D priors.

Authors: We concur that an explicit ablation is required. The revised manuscript will include a controlled experiment training the adapter on (i) raw DINO+SD matches and (ii) the same matches after geodesic filtering on the reconstructed meshes, thereby isolating the contribution of the 3D component. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline composes external models without self-referential reduction

full rationale

The paper presents a composite pipeline that invokes external models (SAM3D for geometry/pose, PartField descriptors, DINO and Stable Diffusion features) to produce filtered correspondences used as supervision for a lightweight adapter. No equations appear in the abstract or description that define a quantity in terms of itself or rename a fitted input as a prediction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The derivation chain therefore remains self-contained against external benchmarks and does not reduce the claimed improvement to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on three domain assumptions about the reliability of existing 3D foundation models and surface-distance filtering; no free parameters or new invented entities are introduced in the abstract.

axioms (3)

domain assumption SAM3D produces usable per-instance 3D geometry and pose from a single image
Invoked in the first processing step of the pipeline.
domain assumption Render-and-compare optimization refines the initial pose estimate to sufficient accuracy
Required before PartField rendering can be performed.
domain assumption Geodesic distances on the reconstructed mesh reliably separate correct from incorrect candidate matches
Used to filter supervision signals for the adapter.

pith-pipeline@v0.9.1-grok · 5777 in / 1384 out tokens · 29833 ms · 2026-06-29T08:25:18.186152+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Aberman, J

K. Aberman, J. Liao, M. Shi, D. Lischinski, B. Chen, and D. Cohen-Or. Neural best-buddies: Sparse cross-domain correspondence.ACM Transactions on Graphics (TOG), 2018

2018
[2]

S. Amir, Y . Gandelsman, S. Bagon, and T. Dekel. Deep ViT features as dense visual descriptors. InEuropean Conference on Computer Vision Workshop (ECCVW), 2022

2022
[3]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. Rädle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollár, N. Ravi, K. Sae...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021

2021
[5]

Y . Chi, L. Sommer, O. Dünkel, D. Muhle, D. Cremers, C. Theobalt, and A. Kortylewski. C3po: Canonicalization of 3d pose from partial views with generalizable correspondence features. In International Conference on 3D Vision (3DV), 2026

2026
[6]

Cuttano, G

C. Cuttano, G. Trivigno, C. Masone, and S. Roth. MARCO: Navigating the unseen space of semantic correspondence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[7]

Donati, A

N. Donati, A. Sharma, and M. Ovsjanikov. Deep geometric functional maps: Robust feature learning for shape correspondence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8592–8601, 2020

2020
[8]

Dünkel, T

O. Dünkel, T. Wimmer, C. Theobalt, C. Rupprecht, and A. Kortylewski. Do it yourself: Learning semantic correspondence from pseudo-labels. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[9]

N. S. Dutt, S. Muralikrishnan, and N. J. Mitra. Diffusion 3d features (diff3f): Decorating untextured shapes with distilled semantic features. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4494–4504, 2024

2024
[10]

Fundel, J

F. Fundel, J. Schusterbauer, V . T. Hu, and B. Ommer. Distillation of diffusion features for semantic correspondence. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025

2025
[11]

B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal flow: Semantic correspondences from object proposals.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40 (7):1711–1725, 2017

2017
[12]

Hartwig, D

R. Hartwig, D. Muhle, R. Marin, and D. Cremers. Geco: Geometrically consistent embedding with lightspeed inference. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9309–9319, 2025. 10

2025
[13]

Hedlin, G

E. Hedlin, G. Sharma, S. Mahajan, H. Isack, A. Kar, A. Tagliasacchi, and K. M. Yi. Unsuper- vised semantic correspondence using stable diffusion. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023
[14]

Huang, Y

Y . Huang, Y . Sun, C. Lai, Q. Xu, X. Wang, X. Shen, and W. Ge. Weakly supervised learning of semantic correspondence through cascaded online correspondence refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[15]

Jesslen, G

A. Jesslen, G. Zhang, A. Wang, W. Ma, A. Yuille, and A. Kortylewski. Novum: Neural object volumes for robust object classification. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[16]

J. Kim, K. Ryoo, J. Seo, G. Lee, D. Kim, H. Cho, and S. Kim. Semi-supervised learning of semantic correspondence with pseudo-labels. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[17]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

N. Kulkarni, S. Tulsiani, and A. Gupta. Canonical surface mapping via geometric cycle consistency. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2202–2211, 2019. doi: 10.1109/ICCV .2019.00229

work page doi:10.1109/iccv 2019
[18]

Li, D.-P

X. Li, D.-P. Fan, F. Yang, A. Luo, H. Cheng, and Z. Liu. Probabilistic model distillation for semantic correspondence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[19]

X. Li, J. Lu, K. Han, and V . A. Prisacariu. SD4Match: Learning to prompt Stable Diffusion model for semantic matching. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 27558–27568, 2024

2024
[20]

C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33(5): 978–994, 2011

2011
[21]

M. Liu, M. A. Uy, D. Xiang, H. Su, S. Fidler, N. Sharp, and J. Gao. PartField: Learning 3D feature fields for part segmentation and beyond.arXiv preprint arXiv:2504.11451, 2025

work page arXiv 2025
[22]

D. G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision (IJCV), 60(2):91–110, 2004

2004
[23]

G. Luo, L. Dunlap, D. H. Park, A. Holynski, and T. Darrell. Diffusion hyperfeatures: Searching through time and space for semantic correspondence. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023
[24]

Mariotti, O

O. Mariotti, O. Mac Aodha, and H. Bilen. Improving semantic correspondence with viewpoint- guided spherical maps. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[25]

Mariotti, Z

O. Mariotti, Z. Du, Y . Bhalgat, O. Mac Aodha, and H. Bilen. Jamais Vu: Exposing the generalization gap in supervised semantic correspondence. InConference on Neural Information Processing Systems (NeurIPS), volume 38, 2025

2025
[26]

J. Min, J. Lee, J. Ponce, and M. Cho. Spair-71k: A large-scale benchmark for semantic correspondence, 2019. URLhttps://arxiv.org/abs/1908.10543

work page arXiv 2019
[27]

Neverova, D

N. Neverova, D. Novotny, V . Khalidov, M. Szafraniec, P. Labatut, and A. Vedaldi. Continuous surface embeddings for deformable shape correspondence.Conference on Neural Information Processing Systems (NeurIPS), 2020

2020
[28]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Ovsjanikov, M

M. Ovsjanikov, M. Ben-Chen, J. Solomon, A. Butscher, and L. Guibas. Functional maps: a flexible representation of maps between shapes.ACM Transactions on Graphics (TOG), 31(4): 1–11, 2012. 11

2012
[30]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

2022
[31]

SAM 3D: 3Dfy Anything in Images

SAM 3D Team, X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, A. Lin, J. Liu, Z. Ma, A. Sagar, B. Song, X. Wang, J. Yang, B. Zhang, P. Dollár, G. Gkioxari, M. Feiszli, and J. Malik. SAM 3D: 3Dfy anything in images, 2025. URL https://arxiv.org/abs/2511.16624

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Shtedritski, C

A. Shtedritski, C. Rupprecht, and A. Vedaldi. Shic: Shape-image correspondences with no keypoint supervision. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[33]

Sommer, O

L. Sommer, O. Dünkel, C. Theobalt, and A. Kortylewski. Common3d: Self-supervised learning of 3d morphable models for common objects in neural feature space. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6468–6479, June 2025

2025
[34]

L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan. Emergent correspondence from image diffusion. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023
[35]

Taniai, S

T. Taniai, S. N. Sinha, and Y . Sato. Joint recovery of dense correspondence and cosegmentation in two images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4246–4255, 2016

2016
[36]

Tumanyan, M

N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel. Plug-and-play diffusion features for text- driven image-to-image translation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, 2023

1921
[37]

Wandel and H

K. Wandel and H. Wang. Semalign3d: Semantic correspondence between rgb-images through aligning 3d object-class representations. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1138–1147, 2025. doi: 10.1109/CVPR52734. 2025.00114

work page doi:10.1109/cvpr52734 2025
[38]

P. Wang, T. Ikeda, R. Lee, and K. Nishiwaki. Gs-pose: Category-level object pose estimation via geometric and semantic correspondence. InEuropean Conference on Computer Vision (ECCV), pages 108–126. Springer, 2024

2024
[39]

Z. Wang, Z. Zhang, J. Xu, J. Wang, T. Pang, C. Du, H. Zhao, and Z. Zhao. Orient anything v2: Unifying orientation and rotation understanding. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025
[40]

F. Xue, S. Elflein, L. Leal-Taixe, and Q. Zhou. MATCHA: Towards matching anything.arXiv preprint arXiv:2501.14945, 2025

work page arXiv 2025
[41]

L. Yi, V . G. Kim, D. Ceylan, W. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A scalable active framework for region annotation in 3d shape collections. InACM Trans. Graphics (Proc. SIGGRAPH Asia), 2016

2016
[42]

H. Yu, Y . Xu, J. Zhang, W. Zhao, Z. Guan, and D. Tao. AP-10k: A benchmark for animal pose estimation in the wild. InConference on Neural Information Processing Systems (NeurIPS), 2021

2021
[43]

Zhang, C

J. Zhang, C. Herrmann, J. Hur, L. F. Polanía, V . Jampani, D. Sun, and M.-H. Yang. A tale of two features: Stable diffusion complements DINO for zero-shot semantic correspondence. In Conference on Neural Information Processing Systems (NeurIPS), 2023

2023
[44]

Zhang, C

J. Zhang, C. Herrmann, J. Hur, E. Chen, V . Jampani, D. Sun, and M.-H. Yang. Telling left from right: Identifying geometry-aware semantic correspondence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3076–3085, 2024

2024
[45]

T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 12

2016
[46]

J. Zhu, Y . Ju, J. Zhang, M. Wang, Z. Yuan, K. Hu, and H. Xu. Densematcher: Learning 3d semantic correspondence for category-level manipulation from a single demo.International Conference on Learning Representations (ICLR), 2025. 13 Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence Supplementary Material This supplement is organi...

2025

[1] [1]

Aberman, J

K. Aberman, J. Liao, M. Shi, D. Lischinski, B. Chen, and D. Cohen-Or. Neural best-buddies: Sparse cross-domain correspondence.ACM Transactions on Graphics (TOG), 2018

2018

[2] [2]

S. Amir, Y . Gandelsman, S. Bagon, and T. Dekel. Deep ViT features as dense visual descriptors. InEuropean Conference on Computer Vision Workshop (ECCVW), 2022

2022

[3] [3]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. Rädle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollár, N. Ravi, K. Sae...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021

2021

[5] [5]

Y . Chi, L. Sommer, O. Dünkel, D. Muhle, D. Cremers, C. Theobalt, and A. Kortylewski. C3po: Canonicalization of 3d pose from partial views with generalizable correspondence features. In International Conference on 3D Vision (3DV), 2026

2026

[6] [6]

Cuttano, G

C. Cuttano, G. Trivigno, C. Masone, and S. Roth. MARCO: Navigating the unseen space of semantic correspondence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[7] [7]

Donati, A

N. Donati, A. Sharma, and M. Ovsjanikov. Deep geometric functional maps: Robust feature learning for shape correspondence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8592–8601, 2020

2020

[8] [8]

Dünkel, T

O. Dünkel, T. Wimmer, C. Theobalt, C. Rupprecht, and A. Kortylewski. Do it yourself: Learning semantic correspondence from pseudo-labels. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[9] [9]

N. S. Dutt, S. Muralikrishnan, and N. J. Mitra. Diffusion 3d features (diff3f): Decorating untextured shapes with distilled semantic features. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4494–4504, 2024

2024

[10] [10]

Fundel, J

F. Fundel, J. Schusterbauer, V . T. Hu, and B. Ommer. Distillation of diffusion features for semantic correspondence. InProceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025

2025

[11] [11]

B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal flow: Semantic correspondences from object proposals.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40 (7):1711–1725, 2017

2017

[12] [12]

Hartwig, D

R. Hartwig, D. Muhle, R. Marin, and D. Cremers. Geco: Geometrically consistent embedding with lightspeed inference. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9309–9319, 2025. 10

2025

[13] [13]

Hedlin, G

E. Hedlin, G. Sharma, S. Mahajan, H. Isack, A. Kar, A. Tagliasacchi, and K. M. Yi. Unsuper- vised semantic correspondence using stable diffusion. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023

[14] [14]

Huang, Y

Y . Huang, Y . Sun, C. Lai, Q. Xu, X. Wang, X. Shen, and W. Ge. Weakly supervised learning of semantic correspondence through cascaded online correspondence refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[15] [15]

Jesslen, G

A. Jesslen, G. Zhang, A. Wang, W. Ma, A. Yuille, and A. Kortylewski. Novum: Neural object volumes for robust object classification. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[16] [16]

J. Kim, K. Ryoo, J. Seo, G. Lee, D. Kim, H. Cho, and S. Kim. Semi-supervised learning of semantic correspondence with pseudo-labels. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[17] [17]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

N. Kulkarni, S. Tulsiani, and A. Gupta. Canonical surface mapping via geometric cycle consistency. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2202–2211, 2019. doi: 10.1109/ICCV .2019.00229

work page doi:10.1109/iccv 2019

[18] [18]

Li, D.-P

X. Li, D.-P. Fan, F. Yang, A. Luo, H. Cheng, and Z. Liu. Probabilistic model distillation for semantic correspondence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021

[19] [19]

X. Li, J. Lu, K. Han, and V . A. Prisacariu. SD4Match: Learning to prompt Stable Diffusion model for semantic matching. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 27558–27568, 2024

2024

[20] [20]

C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33(5): 978–994, 2011

2011

[21] [21]

M. Liu, M. A. Uy, D. Xiang, H. Su, S. Fidler, N. Sharp, and J. Gao. PartField: Learning 3D feature fields for part segmentation and beyond.arXiv preprint arXiv:2504.11451, 2025

work page arXiv 2025

[22] [22]

D. G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision (IJCV), 60(2):91–110, 2004

2004

[23] [23]

G. Luo, L. Dunlap, D. H. Park, A. Holynski, and T. Darrell. Diffusion hyperfeatures: Searching through time and space for semantic correspondence. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023

[24] [24]

Mariotti, O

O. Mariotti, O. Mac Aodha, and H. Bilen. Improving semantic correspondence with viewpoint- guided spherical maps. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[25] [25]

Mariotti, Z

O. Mariotti, Z. Du, Y . Bhalgat, O. Mac Aodha, and H. Bilen. Jamais Vu: Exposing the generalization gap in supervised semantic correspondence. InConference on Neural Information Processing Systems (NeurIPS), volume 38, 2025

2025

[26] [26]

J. Min, J. Lee, J. Ponce, and M. Cho. Spair-71k: A large-scale benchmark for semantic correspondence, 2019. URLhttps://arxiv.org/abs/1908.10543

work page arXiv 2019

[27] [27]

Neverova, D

N. Neverova, D. Novotny, V . Khalidov, M. Szafraniec, P. Labatut, and A. Vedaldi. Continuous surface embeddings for deformable shape correspondence.Conference on Neural Information Processing Systems (NeurIPS), 2020

2020

[28] [28]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Ovsjanikov, M

M. Ovsjanikov, M. Ben-Chen, J. Solomon, A. Butscher, and L. Guibas. Functional maps: a flexible representation of maps between shapes.ACM Transactions on Graphics (TOG), 31(4): 1–11, 2012. 11

2012

[30] [30]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

2022

[31] [31]

SAM 3D: 3Dfy Anything in Images

SAM 3D Team, X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, A. Lin, J. Liu, Z. Ma, A. Sagar, B. Song, X. Wang, J. Yang, B. Zhang, P. Dollár, G. Gkioxari, M. Feiszli, and J. Malik. SAM 3D: 3Dfy anything in images, 2025. URL https://arxiv.org/abs/2511.16624

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Shtedritski, C

A. Shtedritski, C. Rupprecht, and A. Vedaldi. Shic: Shape-image correspondences with no keypoint supervision. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[33] [33]

Sommer, O

L. Sommer, O. Dünkel, C. Theobalt, and A. Kortylewski. Common3d: Self-supervised learning of 3d morphable models for common objects in neural feature space. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6468–6479, June 2025

2025

[34] [34]

L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan. Emergent correspondence from image diffusion. InConference on Neural Information Processing Systems (NeurIPS), 2023

2023

[35] [35]

Taniai, S

T. Taniai, S. N. Sinha, and Y . Sato. Joint recovery of dense correspondence and cosegmentation in two images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4246–4255, 2016

2016

[36] [36]

Tumanyan, M

N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel. Plug-and-play diffusion features for text- driven image-to-image translation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, 2023

1921

[37] [37]

Wandel and H

K. Wandel and H. Wang. Semalign3d: Semantic correspondence between rgb-images through aligning 3d object-class representations. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1138–1147, 2025. doi: 10.1109/CVPR52734. 2025.00114

work page doi:10.1109/cvpr52734 2025

[38] [38]

P. Wang, T. Ikeda, R. Lee, and K. Nishiwaki. Gs-pose: Category-level object pose estimation via geometric and semantic correspondence. InEuropean Conference on Computer Vision (ECCV), pages 108–126. Springer, 2024

2024

[39] [39]

Z. Wang, Z. Zhang, J. Xu, J. Wang, T. Pang, C. Du, H. Zhao, and Z. Zhao. Orient anything v2: Unifying orientation and rotation understanding. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025

[40] [40]

F. Xue, S. Elflein, L. Leal-Taixe, and Q. Zhou. MATCHA: Towards matching anything.arXiv preprint arXiv:2501.14945, 2025

work page arXiv 2025

[41] [41]

L. Yi, V . G. Kim, D. Ceylan, W. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas. A scalable active framework for region annotation in 3d shape collections. InACM Trans. Graphics (Proc. SIGGRAPH Asia), 2016

2016

[42] [42]

H. Yu, Y . Xu, J. Zhang, W. Zhao, Z. Guan, and D. Tao. AP-10k: A benchmark for animal pose estimation in the wild. InConference on Neural Information Processing Systems (NeurIPS), 2021

2021

[43] [43]

Zhang, C

J. Zhang, C. Herrmann, J. Hur, L. F. Polanía, V . Jampani, D. Sun, and M.-H. Yang. A tale of two features: Stable diffusion complements DINO for zero-shot semantic correspondence. In Conference on Neural Information Processing Systems (NeurIPS), 2023

2023

[44] [44]

Zhang, C

J. Zhang, C. Herrmann, J. Hur, E. Chen, V . Jampani, D. Sun, and M.-H. Yang. Telling left from right: Identifying geometry-aware semantic correspondence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3076–3085, 2024

2024

[45] [45]

T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 12

2016

[46] [46]

J. Zhu, Y . Ju, J. Zhang, M. Wang, Z. Yuan, K. Hu, and H. Xu. Densematcher: Learning 3d semantic correspondence for category-level manipulation from a single demo.International Conference on Learning Representations (ICLR), 2025. 13 Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence Supplementary Material This supplement is organi...

2025