pith. machine review for the scientific record. sign in

arxiv: 2604.18267 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

MARCO: Navigating the Unseen Space of Semantic Correspondence

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic correspondenceself-distillationDINOv2keypoint matchinggeneralizationfine-grained localizationSPair-71k
0
0 comments X

The pith

MARCO uses self-distillation on a DINOv2 backbone to turn sparse keypoints into dense semantic correspondences that generalize to unseen points and categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a compact model built on DINOv2 can reach or exceed the accuracy of much larger diffusion-based systems for semantic correspondence by adding a coarse-to-fine objective and a self-distillation step. This matters because existing high-performing methods still fail when queried points or object categories lie outside the training distribution, which is the common case in real applications. The self-distillation converts a small number of annotated keypoints into dense, semantically consistent correspondence fields without increasing model size or runtime. Experiments demonstrate that the accuracy improvements grow larger at strict fine-grained thresholds and on benchmarks that test unseen keypoints or categories.

Core claim

Building upon DINOv2, MARCO introduces a unified model for generalizable correspondence driven by a novel training framework. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework which expands sparse supervision beyond annotated regions, the approach transforms a handful of keypoints into dense, semantically coherent correspondences while remaining 3x smaller and 10x faster than diffusion-based alternatives.

What carries the argument

The self-distillation framework that uses the model's own predictions to expand sparse keypoint annotations into dense, semantically coherent correspondence maps, combined with a coarse-to-fine refinement objective.

If this is right

  • New state-of-the-art results on SPair-71k, AP-10K, and PF-PASCAL, with the largest gains at fine-grained thresholds such as PCK@0.01.
  • Strongest reported generalization to unseen keypoints, reaching +5.1 on SPair-U.
  • Strongest reported generalization to unseen categories, reaching +4.7 on MP-100.
  • Model remains 3 times smaller and runs 10 times faster than competing diffusion-based approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Self-distillation may serve as a general technique to densify supervision in other sparse-label vision tasks without collecting additional annotations.
  • The results suggest that carefully designed training objectives on DINOv2 features can replace the need for explicit diffusion backbones in correspondence problems.
  • The same densification process could be tested on video sequences or 3D point clouds to produce temporally or spatially consistent matches.
  • The reduced size and speed make accurate semantic correspondence feasible for real-time applications on edge devices.

Load-bearing premise

The self-distillation step expands sparse keypoint supervision into dense correspondences without introducing systematic errors or confirmation bias from the teacher predictions.

What would settle it

Train the model on SPair-71k and evaluate PCK@0.01 and generalization metrics on a dataset whose keypoints and object categories are completely disjoint from all training data; if the reported gains over prior methods disappear, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.18267 by Carlo Masone, Claudia Cuttano, Gabriele Trivigno, Stefan Roth.

Figure 1
Figure 1. Figure 1: MARCO, a model for generalizable correspondences. Built on DINOv2, MARCO explores the unseen space of semantic correspondence by inferring structure that lies beyond the sparsity of keypoint annotations. During training, it discovers reliable matches across instances and propagates them smoothly across the object surface, transforming limited keypoint supervision into a dense training signal. Compared to p… view at source ↗
Figure 2
Figure 2. Figure 2: Flow consistency in DINOv2. Semantic flow (in HSV space) from raw feature matches between two objects. Fine-tuning on sparse keypoints improves only the landmarks’ representation, reducing geometric coherence (b). Our self-supervised objective produces smooth, object-consistent flow across the surface (c). fines local geometric relationships that transport semantics smoothly across the object surface. Sinc… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of MARCO. We insert lightweight adapters into DINOv2 and add a compact upsampling layer (red). At training time, we propose a coarse-to-fine Gaussian RBF loss that progressively sharpens peaks on annotated keypoints and a self-distillation objective that exploits the pre-existing structure of DINOv2 features. Given source and target images, we extract features from an EMA teacher and identify reli… view at source ↗
Figure 4
Figure 4. Figure 4: Correspondence mining via flow anchoring. Given a source and target image, the dense displacement field is estimated from sparse matches using piece-wise affine motion on a Delau￾nay triangulation. We then identify regions of coherent motion via k-means clustering in displacement space. Due to model inaccura￾cies, occlusion, or object symmetries, coherent motion may lead to incorrect matches: we anchor clu… view at source ↗
Figure 5
Figure 5. Figure 5: MP–100 macro-domains at a glance. Evaluation metrics. Following [28, 32, 53, 54], we use the Percentage of Correct Keypoints (PCK@α) as metric, for which a prediction is considered correct if it lies within a distance of α·max(h, w) from the ground-truth keypoint; here, h and w denote the object bounding-box dimensions. We report per-image PCK averaged over the test set. Implementation details. We use DINO… view at source ↗
Figure 6
Figure 6. Figure 6: MP-100 samples used in our benchmark. Each column shows representative instances from the five macro-domains. v [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results with MARCO. Examples from Spair-71k [34] and MP-100 [48]. vii [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training. Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which expands sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences. MARCO sets a new state of the art on SPair-71k, AP-10K, and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+8.9 PCK@0.01), strongest generalization to unseen keypoints (+5.1, SPair-U) and categories (+4.7, MP-100), while remaining 3x smaller and 10x faster than diffusion-based approaches. Code is available at https://github.com/visinf/MARCO .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MARCO, a DINOv2-based unified model for semantic correspondence. It proposes a coarse-to-fine objective coupled with a self-distillation framework that expands sparse keypoint annotations into dense, semantically coherent pseudo-labels. The work claims new state-of-the-art results on SPair-71k, AP-10K, and PF-PASCAL, with amplified gains at fine-grained thresholds (+8.9 PCK@0.01), improved generalization to unseen keypoints (+5.1 PCK on SPair-U) and categories (+4.7 on MP-100), and efficiency benefits (3x smaller and 10x faster than diffusion-based approaches). Code is released.

Significance. If the central claims hold after validation, the work would represent a meaningful advance in semantic correspondence by addressing poor generalization beyond training keypoints, a key limitation for real-world use. The efficiency gains relative to large diffusion models could increase accessibility, and the explicit code release at https://github.com/visinf/MARCO is a clear strength supporting reproducibility.

major comments (2)
  1. The self-distillation framework (described as expanding sparse supervision beyond annotated regions via coupling to the coarse-to-fine objective) is load-bearing for the generalization results (+5.1 PCK on SPair-U unseen keypoints and +4.7 on MP-100 unseen categories). However, the manuscript provides no independent quantitative measurement of pseudo-label fidelity or confirmation bias outside the original sparse set, leaving open the possibility that teacher (DINOv2) errors are propagated rather than true semantic generalization being achieved.
  2. The experimental results section reports SOTA gains including +8.9 PCK@0.01 without error bars, ablation studies isolating the self-distillation component, or full details on baseline implementations and statistical significance testing. This undermines assessment of whether the fine-grained and out-of-distribution improvements are robust or could be inflated.
minor comments (2)
  1. The abstract states efficiency claims (3x smaller, 10x faster) but does not specify the exact parameter counts or runtime measurements used for comparison with diffusion-based approaches.
  2. Ensure that all tables and figures in the full manuscript include clear captions, units, and any necessary statistical details for the PCK metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment point by point below and describe the revisions we will incorporate to strengthen the presentation and validation of our results.

read point-by-point responses
  1. Referee: The self-distillation framework (described as expanding sparse supervision beyond annotated regions via coupling to the coarse-to-fine objective) is load-bearing for the generalization results (+5.1 PCK on SPair-U unseen keypoints and +4.7 on MP-100 unseen categories). However, the manuscript provides no independent quantitative measurement of pseudo-label fidelity or confirmation bias outside the original sparse set, leaving open the possibility that teacher (DINOv2) errors are propagated rather than true semantic generalization being achieved.

    Authors: We agree that direct quantitative assessment of pseudo-label fidelity would provide stronger support for the generalization claims. The performance improvements on unseen keypoints and categories are larger than those achieved by the DINOv2 teacher alone, which suggests the self-distillation process contributes to genuine semantic expansion rather than simple error propagation. Nevertheless, to address this concern explicitly, we will add a new analysis subsection that measures pseudo-label quality against available ground-truth annotations on held-out regions and includes controlled ablations to quantify any confirmation bias. revision: yes

  2. Referee: The experimental results section reports SOTA gains including +8.9 PCK@0.01 without error bars, ablation studies isolating the self-distillation component, or full details on baseline implementations and statistical significance testing. This undermines assessment of whether the fine-grained and out-of-distribution improvements are robust or could be inflated.

    Authors: We acknowledge that reporting error bars, isolating the self-distillation contribution, providing fuller baseline implementation details, and including statistical significance tests would improve the robustness assessment. The current manuscript contains ablations on the overall framework and coarse-to-fine objective, but we will expand the experimental section in revision to report standard deviations over multiple random seeds, add a dedicated ablation table isolating self-distillation, include precise reproduction details for all baselines, and apply appropriate statistical tests (e.g., paired t-tests) to the reported gains at fine thresholds and on out-of-distribution splits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training framework with independent benchmark validation

full rationale

The paper presents MARCO as an empirical method that augments DINOv2 via a self-distillation framework and coarse-to-fine objective to expand sparse keypoint supervision into dense correspondences. No equations, derivations, or self-citations are shown that reduce the reported PCK gains, generalization metrics, or efficiency claims to quantities defined by the inputs or by construction. The core claims rest on released code and external benchmark results rather than tautological redefinitions or fitted-parameter predictions. This is a standard self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that DINOv2 features already encode sufficient semantic information and that self-generated pseudo-labels from the model itself remain reliable when supervision is sparse.

axioms (1)
  • domain assumption DINOv2 provides a strong semantic feature backbone for correspondence tasks
    The entire model is built by extending DINOv2 rather than learning features from scratch.

pith-pipeline@v0.9.0 · 5511 in / 1312 out tokens · 32468 ms · 2026-05-10T05:40:06.476590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

171 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Deep ViT features as dense visual descriptors

    Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep ViT features as dense visual descriptors. In ECCV Workshop on What is Motion For?, 2022

  2. [2]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650--9660, 2021

  3. [3]

    On the over-smoothing problem of CNN based disparity estimation

    Chuangrong Chen, Xiaozhi Chen, and Hui Cheng. On the over-smoothing problem of CNN based disparity estimation. In CVPR, pages 8997--9005, 2019

  4. [4]

    AdaptFormer : A dapting vision transformers for scalable visual recognition

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. AdaptFormer : A dapting vision transformers for scalable visual recognition. In NeurIPS, volume 35, pages 16664--16678, 2022

  5. [5]

    Cats++: Boosting cost aggregation with convolutions and transformers

    Seokju Cho, Sunghwan Hong, and Seungryong Kim. Cats++: Boosting cost aggregation with convolutions and transformers. IEEE Trans. Pattern Anal. Mach. Intell., 45 0 (6): 0 7174--7194, 2023

  6. [6]

    SANSA : Unleashing the hidden semantics in SAM2 for few-shot segmentation

    Claudia Cuttano, Gabriele Trivigno, Giuseppe Averta, and Carlo Masone. SANSA : Unleashing the hidden semantics in SAM2 for few-shot segmentation. In NeurIPS, volume 38, 2025

  7. [7]

    Do it yourself: Learning semantic correspondence from pseudo-labels

    Olaf D \"u nkel, Thomas Wimmer, Christian Theobalt, Christian Rupprecht, and Adam Kortylewski. Do it yourself: Learning semantic correspondence from pseudo-labels. In ICCV, pages 5834--5844, 2025

  8. [8]

    o kman, M rten Wadenb \

    Johan Edstedt, Qiyu Sun, Georg B \"o kman, M rten Wadenb \"a ck, and Michael Felsberg. RoMa : R obust dense feature matching. In CVPR, pages 19790--19800, 2024

  9. [9]

    Probing the 3D awareness of visual foundation models

    Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3D awareness of visual foundation models. In CVPR, pages 21795--21806, 2024

  10. [10]

    Distillation of diffusion features for semantic correspondence

    Frank Fundel, Johannes Schusterbauer, Vincent Tao Hu, and Bj \"o rn Ommer. Distillation of diffusion features for semantic correspondence. In WACV, pages 6762--6774, 2025

  11. [11]

    Bias-compensated integral regression for human pose estimation

    Kerui Gu, Linlin Yang, Michael Bi Mi, and Angela Yao. Bias-compensated integral regression for human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell., 45 0 (9): 0 10687--10702, 2023

  12. [12]

    Goldman, and Dani Lischinski

    Yoav HaCohen, Eli Shechtman, Dan B. Goldman, and Dani Lischinski. N on-rigid dense correspondence with applications for image enhancement. In SIGGRAPH, 2011

  13. [13]

    Proposal Flow : S emantic correspondences from object proposals

    Bumsub Ham, Minsu Cho, Cordelia Schmid, and Jean Ponce. Proposal Flow : S emantic correspondences from object proposals. IEEE Trans. Pattern Anal. Mach. Intell., 40 0 (7): 0 1711--1725, 2017

  14. [14]

    GECO : G eometrically consistent embedding with lightspeed inference

    Regine Hartwig, Dominik Muhle, Riccardo Marin, and Daniel Cremers. GECO : G eometrically consistent embedding with lightspeed inference. In ICCV, pages 9309--9319, 2025

  15. [15]

    Cost aggregation with 4D convolutional S win transformer for few-shot segmentation

    Sunghwan Hong, Seokju Cho, Jisu Nam, Stephen Lin, and Seungryong Kim. Cost aggregation with 4D convolutional S win transformer for few-shot segmentation. In ECCV, volume 29, pages 108--126, 2022 a

  16. [16]

    Neural matching fields: I mplicit representation of matching fields for visual correspondence

    Sunghwan Hong, Jisu Nam, Seokju Cho, Susung Hong, Sangryul Jeon, Dongbo Min, and Seungryong Kim. Neural matching fields: I mplicit representation of matching fields for visual correspondence. In NeurIPS, volume 35, pages 13512--13526, 2022 b

  17. [17]

    LoRa : L ow-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRa : L ow-rank adaptation of large language models. In ICLR, 2022

  18. [18]

    Learning semantic correspondence with sparse annotations

    Shuaiyi Huang, Luyu Yang, Bo He, Songyang Zhang, Xuming He, and Abhinav Shrivastava. Learning semantic correspondence with sparse annotations. In ECCV, volume 14, pages 267--284, 2022

  19. [19]

    Doduo : D ense visual correspondence from unsupervised semantic-aware flow

    Zhenyu Jiang, Hanwen Jiang, and Yuke Zhu. Doduo : D ense visual correspondence from unsupervised semantic-aware flow. In ICRA, pages 12420--12427, 2024

  20. [20]

    S emantic attribute matching networks

    Seungryong Kim, Dongbo Min, Somi Jeong, Sunok Kim, Sangryul Jeon, and Kwanghoon Sohn. S emantic attribute matching networks. In CVPR, pages 12339--12348, 2019

  21. [21]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam : A method for stochastic optimization. In ICLR, pages 1--13, 2015

  22. [22]

    Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment A nything. In ICCV, pages 4015--4026, 2023

  23. [23]

    T he functional correspondence problem

    Zihang Lai, Senthil Purushwalkam, and Abhinav Gupta. T he functional correspondence problem. In CVPR, pages 15772--15781, 2021

  24. [24]

    SFNet : L earning object-aware semantic correspondence

    Junghyup Lee, Dohyung Kim, Jean Ponce, and Bumsub Ham. SFNet : L earning object-aware semantic correspondence. In CVPR, pages 2278--2287, 2019

  25. [25]

    Costain, Henry Howard-Jenkins, and Victor Prisacariu

    Shuda Li, Kai Han, Theo W. Costain, Henry Howard-Jenkins, and Victor Prisacariu. Correspondence networks with adaptive neighbourhood consensus. In CVPR, pages 10196--10205, 2020

  26. [26]

    Probabilistic model distillation for semantic correspondence

    Xin Li, Deng-Ping Fan, Fan Yang, Ao Luo, Hong Cheng, and Zicheng Liu. Probabilistic model distillation for semantic correspondence. In CVPR, pages 7505--7514, 2021

  27. [27]

    SimSC : A simple framework for semantic correspondence with temperature learning

    Xinghui Li, Kai Han, Xingchen Wan, and Victor Adrian Prisacariu. SimSC : A simple framework for semantic correspondence with temperature learning. arXiv:2305.02385 [cs.CV], 2023

  28. [28]

    SD4Match : L earning to prompt S table D iffusion model for semantic matching

    Xinghui Li, Jingyi Lu, Kai Han, and Victor Adrian Prisacariu. SD4Match : L earning to prompt S table D iffusion model for semantic matching. In CVPR, pages 27558--27568, 2024

  29. [29]

    SIFT F low: D ense correspondence across scenes and its applications

    Ce Liu, Jenny Yuen, and Antonio Torralba. SIFT F low: D ense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell., 33 0 (5): 0 978--994, 2011

  30. [30]

    Diffusion Hyperfeatures : S earching through time and space for semantic correspondence

    Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. Diffusion Hyperfeatures : S earching through time and space for semantic correspondence. In NeurIPS, volume 36, pages 47500--47510, 2023

  31. [31]

    Improving semantic correspondence with viewpoint-guided spherical maps

    Octave Mariotti, Oisin Mac Aodha, and Hakan Bilen. Improving semantic correspondence with viewpoint-guided spherical maps. In CVPR, pages 19521--19530, 2024

  32. [32]

    Jamais V u: E xposing the generalization gap in supervised semantic correspondence

    Octave Mariotti, Zhipeng Du, Yash Bhalgat, Oisin Mac Aodha, and Hakan Bilen. Jamais V u: E xposing the generalization gap in supervised semantic correspondence. In NeurIPS, volume 38, 2025

  33. [33]

    Convolutional H ough matching networks

    Juhong Min and Minsu Cho. Convolutional H ough matching networks. In CVPR, pages 2940--2950, 2021

  34. [34]

    Spair-71k: A large-scale benchmark for semantic correspondence.arXiv preprint arXiv:1908.10543,

    Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. SPair-71k : A large-scale benchmark for semantic correspondence. arXiv:1908.10543 [cs.CV], 2019

  35. [35]

    ESCAPE : E ncoding S uper-keypoints for C ategory- A gnostic P ose E stimation

    Khoi Duc Nguyen, Chen Li, and Gim Hee Lee. ESCAPE : E ncoding S uper-keypoints for C ategory- A gnostic P ose E stimation. In CVPR, pages 23491--23500, 2024

  36. [36]

    Neural congealing: A ligning images to a joint semantic atlas

    Dolev Ofri-Amar, Michal Geyer, Yoni Kasten, and Tali Dekel. Neural congealing: A ligning images to a joint semantic atlas. In CVPR, pages 19403--19412, 2023

  37. [37]

    DINO v2: L earning robust visual features without supervision

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINO v2: L earning robust visual features without supervision. Trans. Mach. Learn. Res., 2024

  38. [38]

    The sampling- G aussian for stereo matching

    Baiyu Pan, Bowen Yao, Jianxin Pang, Jun Cheng, et al. The sampling- G aussian for stereo matching. In ICLR, 2025

  39. [39]

    Neighbourhood consensus networks

    Ignacio Rocco, Mircea Cimpoi, Relja Arandjelovi \'c , Akihiko Torii, Tomas Pajdla, and Josef Sivic. Neighbourhood consensus networks. In NeurIPS, volume 31, pages 1658--1669, 2018

  40. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684--10695, 2022

  41. [41]

    LoFTR : D etector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR : D etector-free local feature matching with transformers. In CVPR, pages 8922--8931, 2021

  42. [42]

    Pixel-level semantic correspondence through layout-aware representation learning and multi-scale matching integration

    Yixuan Sun, Zhangyue Yin, Haibo Wang, Yan Wang, Xipeng Qiu, Weifeng Ge, and Wenqiang Zhang. Pixel-level semantic correspondence through layout-aware representation learning and multi-scale matching integration. In CVPR, pages 17047--17056, 2024

  43. [43]

    Emergent correspondence from image diffusion

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion. In NeurIPS, volume 36, pages 1363--1389, 2023

  44. [44]

    GLU-Net : G lobal-local universal network for dense flow and correspondences

    Prune Truong, Martin Danelljan, and Radu Timofte. GLU-Net : G lobal-local universal network for dense flow and correspondences. In CVPR, pages 6258--6268, 2020

  45. [45]

    Splicing ViT features for semantic appearance transfer

    Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Splicing ViT features for semantic appearance transfer. In CVPR, pages 10748--10757, 2022

  46. [46]

    DINO-Tracker : T aming DINO for self-supervised point tracking in a single video

    Narek Tumanyan, Assaf Singer, Shai Bagon, and Tali Dekel. DINO-Tracker : T aming DINO for self-supervised point tracking in a single video. In ECCV, volume 26, pages 367--385, 2024

  47. [47]

    COVE : U nleashing the D iffusion F eature correspondence for consistent video editing

    Jiangshan Wang, Yue Ma, Jiayi Guo, Yicheng Xiao, Gao Huang, and Xiu Li. COVE : U nleashing the D iffusion F eature correspondence for consistent video editing. In NeurIPS, volume 37, pages 96541--96565, 2024

  48. [48]

    Pose for everything: T owards category-agnostic pose estimation

    Lumin Xu, Sheng Jin, Wang Zeng, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, and Xiaogang Wang. Pose for everything: T owards category-agnostic pose estimation. In ECCV, volume 6, pages 398--416, 2022

  49. [49]

    MATCHA : T owards matching anything

    Fei Xue, Sven Elflein, Laura Leal-Taix \'e , and Qunjie Zhou. MATCHA : T owards matching anything. In CVPR, pages 27081--27091, 2025

  50. [50]

    Chiao-An Yang and Raymond A. Yeh. Heatmap regression without soft-argmax for facial landmark detection. In CVPR, pages 28729--28739, 2025

  51. [51]

    AP-10K : A benchmark for animal pose estimation in the wild

    Hang Yu, Yufei Xu, Jing Zhang, Wei Zhao, Ziyu Guan, and Dacheng Tao. AP-10K : A benchmark for animal pose estimation in the wild. In NeurIPS Datasets and Benchmarks, 2021

  52. [52]

    Distribution-aware coordinate representation for human pose estimation

    Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, and Ce Zhu. Distribution-aware coordinate representation for human pose estimation. In CVPR, pages 7091--7100, 2020

  53. [53]

    A T ale of two features: S table D iffusion complements DINO for zero-shot semantic correspondence

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A T ale of two features: S table D iffusion complements DINO for zero-shot semantic correspondence. In NeurIPS, volume 36, pages 45533--45547, 2023

  54. [54]

    Telling left from right: I dentifying geometry-aware semantic correspondence

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. Telling left from right: I dentifying geometry-aware semantic correspondence. In CVPR, pages 3076--3085, 2024

  55. [55]

    Li, Xinghui and Han, Kai and Wan, Xingchen and Prisacariu, Victor Adrian , journal=

  56. [56]

    Object discovery and representation networks , author=

  57. [57]

    Efficient Visual Pretraining With Contrastive Detection , author=

  58. [58]

    Wandel, Krispin and Wang, Hesheng , booktitle=cvpr-2025, pages=

  59. [59]

    Nguyen, Khoi Duc and Li, Chen and Lee, Gim Hee , booktitle=cvpr-2024, pages=

  60. [60]

    Kim, Seungryong and Min, Dongbo and Jeong, Somi and Kim, Sunok and Jeon, Sangryul and Sohn, Kwanghoon , booktitle=cvpr-2019, pages=

  61. [61]

    Lai, Zihang and Purushwalkam, Senthil and Gupta, Abhinav , booktitle=cvpr-2021, pages=

  62. [62]

    Wang, Jiangshan and Ma, Yue and Guo, Jiayi and Xiao, Yicheng and Huang, Gao and Li, Xiu , booktitle=neurips-2024, volume=

  63. [63]

    Qian, Qiyang and Chen, Hansheng and Tomizuka, Masayoshi and Keutzer, Kurt and Wang, Qianqian and Xu, Chenfeng , booktitle=cvpr-2025, pages=

  64. [64]

    Distillation of Diffusion Features for Semantic Correspondence , author=

  65. [65]

    Lee, Junghyup and Kim, Dohyung and Ponce, Jean and Ham, Bumsub , booktitle=cvpr-2019, pages=

  66. [66]

    Semi-supervised learning of semantic correspondence with pseudo-labels , author=

  67. [67]

    Luo, Grace and Dunlap, Lisa and Park, Dong Huk and Holynski, Aleksander and Darrell, Trevor , booktitle=neurips-2023, volume=

  68. [68]

    Learning Semantic Correspondence with Sparse Annotations , author=

  69. [69]

    Han, Kai and Rezende, Rafael S and Ham, Bumsub and Wong, Kwan-Yee K and Cho, Minsu and Schmid, Cordelia and Ponce, Jean , booktitle=cvpr-2017, pages=

  70. [70]

    Probabilistic Model Distillation for Semantic Correspondence , author=

  71. [71]

    Pixelwise View Selection for Unstructured Multi-View Stereo , author=

  72. [72]

    Toft, Carl and Stenborg, Erik and Hammarstrand, Lars and Brynte, Lucas and Pollefeys, Marc and Sattler, Torsten and Kahl, Fredrik , title =

  73. [73]

    High-resolution image synthesis with latent diffusion models , author=

  74. [74]

    Bay, Herbert and Tuytelaars, Tinne and Van Gool, Luc , booktitle=eccv-2006, pages=

  75. [75]

    Into the Rabbit Hull:

    Fel, Thomas and Wang, Binxu and Lepori, Michael A and Kowal, Matthew and Lee, Andrew and Balestriero, Randall and Joseph, Sonia and Lubana, Ekdeep S and Konkle, Talia and Ba, Demba and others , journal=. Into the Rabbit Hull:

  76. [76]

    Distinctive Image Features from Scale-invariant Keypoints , author=

  77. [77]

    End-to-end Weakly-supervised Semantic Alignment , author=

  78. [78]

    Convolutional

    Min, Juhong and Cho, Minsu , booktitle=cvpr-2021, pages=. Convolutional

  79. [79]

    Correspondence Networks with Adaptive Neighbourhood Consensus , author=

  80. [80]

    Neighbourhood Consensus Networks , author=

Showing first 80 references.