arxiv: 2511.02830 · v2 · submitted 2025-11-04 · 💻 cs.CV

Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

Dmitrii Pozdeev , Alexey Artemov , Ananta R. Bhattarai , Artem Sevastopolsky This is my paper

Pith reviewed 2026-05-18 00:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords canonical embeddingsdense correspondenceshuman headspoint tracksvision transformerface trackingcontrastive loss3D morphable models

0 comments

The pith

DenseMarks learns pixel-wise 3D embeddings in a shared canonical cube for human heads by supervising a Vision Transformer with dense point tracks from videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to produce dense, consistent correspondences for human head images by mapping every pixel to a location inside a fixed 3D unit cube. Training relies on contrastive loss applied to point matches automatically extracted by a tracker across many in-the-wild talking-head videos, augmented by landmark detection, segmentation, and continuity constraints inside the cube. Because the supervision is dense and the bottleneck is a single canonical space, the resulting embeddings stay aligned even when heads move or belong to different people and extend to hair regions. A reader would care because this yields a queryable representation usable for tracking, semantic part matching, and stereo tasks without requiring manual 3D labels or per-person fitting.

Core claim

We propose DenseMarks, a new learned representation for human heads where a Vision Transformer network predicts a 3D embedding for each pixel that corresponds to a location in a 3D canonical unit cube. Training collects pairwise point matches estimated by a state-of-the-art point tracker over diverse in-the-wild talking heads videos and guides the mapping via a contrastive loss that encourages matched points to have close embeddings. Multi-task learning with face landmarks and segmentation constraints together with spatial continuity imposed through latent cube features produces an interpretable and queryable canonical space that remains consistent across poses and individuals.

What carries the argument

A Vision Transformer that outputs a 3D embedding vector for every pixel, constrained to lie inside a shared canonical unit cube and trained with contrastive loss on automatically tracked point pairs.

If this is right

Matched points across different head images can be found by nearest-neighbor lookup in the canonical cube.
Monocular head tracking can be performed by fitting a 3D Morphable Model to the predicted embeddings.
Stereo reconstruction benefits from the dense geometry-aware correspondences produced by the embedding.
The representation remains stable under large pose changes and includes hair as well as facial skin.
Embeddings stay aligned for the same semantic location even when images come from different individuals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The canonical cube could be queried to transfer texture or expression between different heads without an explicit 3D mesh.
Similar track-based supervision might extend the same idea to full-body or hand surfaces if comparable video data exists.
Interactive applications could let users click a point in one view and instantly locate its counterpart in another view via the shared cube coordinates.

Load-bearing premise

Pairwise point matches produced by an off-the-shelf tracker on in-the-wild videos are accurate and dense enough to serve as reliable supervision for learning a consistent canonical embedding.

What would settle it

On a test collection of head images with independently verified dense correspondences, the learned embeddings for truly corresponding points fail to lie close together inside the canonical cube.

Figures

Figures reproduced from arXiv: 2511.02830 by Alexey Artemov, Ananta R. Bhattarai, Artem Sevastopolsky, Dmitrii Pozdeev.

**Figure 2.** Figure 2: To learn our representation, we train an embedder network [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Point querying. We select a specific point on a few images and find the reference embed [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Semantic regions on head images can be located via selecting corresponding volumetric [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Dense warping. Here, we copy pixels from source to target based on the target [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Monocular tracking. We evaluate our method on downstream application of applying [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Stereo Reconstruction. We triangulate 2-view and 3-view correspondences of our representations using known camera parameters in Nersemble (Kirschstein et al., 2023). Region selection. In [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Removing the landmark or segmentation loss makes region finding much less reliable. Blue: forehead center, green and orange: ears, yellow: skin near the left eyebrow corner. Losses ablation. Even though the network can learn without introduced constraints on landmark locations in the cube and segmentation loss, we demonstrate that the finding characteristic points and regions becomes more problematic i… view at source ↗

**Figure 9.** Figure 9: Dense warping. Pixels are copied from the source to the target based on a target [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Blue: forehead center, green and orange: ears, yellow: skin near the left eyebrow corner. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 12.** Figure 12: Predicted canonical coordinates are robust to lighting changes. Ours: result of [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 11.** Figure 11: For each pose (colored images on the left), we visualize the surface corresponding to [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 13.** Figure 13: Stereo reconstruction results for two, three (narrow/wide baseline), and five views. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper learns dense canonical 3D embeddings for heads by contrasting real-video point tracks from an off-the-shelf tracker, plus landmark and segmentation losses, and it looks workable for correspondence tasks if the tracks hold up.

read the letter

The core contribution is training a ViT to output 3D coordinates inside a unit cube for every pixel in a head image. Supervision comes from contrastive loss on pairwise matches produced by a state-of-the-art point tracker run on in-the-wild talking-head videos, with extra landmark, segmentation, and latent-cube continuity terms to keep things consistent. They report state-of-the-art numbers on geometry-aware matching and monocular tracking with 3DMMs, and they plan to release code and checkpoints. That combination of real-video tracks with the canonical-cube bottleneck is the part that feels fresh; most prior dense head work leans more on synthetic data or sparse landmarks. The approach does give a queryable space that covers hair and stays stable across pose and identity, which is practically useful for tracking and stereo tasks. The main soft spot is exactly the one the stress test flags: everything rests on the tracker producing accurate, dense matches even in low-texture hair regions or under blur and occlusion. The auxiliary losses and continuity constraint help, but they cannot fully rescue a noisy primary signal. The paper does not appear to include a direct ablation on tracker error or a breakdown of failure cases on hair, so it is still unclear how much the claimed robustness comes from the data versus the model. This is the kind of work that belongs in a computer-vision reading group focused on dense correspondence or head modeling. A reader already working on tracking or reconstruction would get immediate value from the released model and the canonical-space idea. It is worth a serious referee: the method is concrete, the experiments are on real data, and the open questions about supervision quality are the sort that peer review can sharpen.

Referee Report

2 major / 2 minor

Summary. The paper introduces DenseMarks, a ViT-based model that maps pixels from 2D human head images to 3D embeddings inside a canonical unit cube. Training uses contrastive loss on dense pairwise point matches produced by an off-the-shelf tracker applied to in-the-wild talking-head videos, augmented by multi-task losses on landmarks and segmentation plus a continuity regularizer on the latent cube. The resulting embedding is claimed to be pose- and identity-invariant, to cover the full head including hair, and to enable applications such as semantic part correspondence, head tracking, and stereo reconstruction, with reported state-of-the-art performance on geometry-aware point matching and monocular 3DMM tracking.

Significance. If the central claims hold, the work supplies a practical, queryable canonical representation for dense head correspondences that could improve downstream tasks in face analysis and reconstruction. The explicit use of an external tracker for supervision and the release of code and checkpoints are positive contributions that facilitate reproducibility.

major comments (2)

[§3.2] §3.2 (contrastive loss formulation): the claim that the canonical-space bottleneck plus contrastive loss produces embeddings that are robust to pose and identity variations rests on the assumption that the off-the-shelf point tracks are sufficiently accurate and dense; no quantitative error analysis or noise model for the tracker outputs on the collected video set is provided, leaving open the possibility that embedding inconsistencies arise directly from tracker failures in low-texture or occluded regions.
[§4] §4 (experiments): the reported state-of-the-art results on geometry-aware point matching and monocular head tracking are presented without ablations that isolate the contribution of the contrastive term versus the auxiliary landmark/segmentation losses; this makes it difficult to verify whether the claimed robustness and full-head coverage follow from the primary supervision signal or from the multi-task regularizers.

minor comments (2)

[Abstract] The abstract and introduction repeatedly use the phrase 'strong supervision' without defining its quantitative strength relative to the tracker error rate.
[Figures] Figure captions for the qualitative results should include the specific video sequences or identities shown to allow readers to assess coverage of hair and extreme poses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of DenseMarks. We address the two major comments below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (contrastive loss formulation): the claim that the canonical-space bottleneck plus contrastive loss produces embeddings that are robust to pose and identity variations rests on the assumption that the off-the-shelf point tracks are sufficiently accurate and dense; no quantitative error analysis or noise model for the tracker outputs on the collected video set is provided, leaving open the possibility that embedding inconsistencies arise directly from tracker failures in low-texture or occluded regions.

Authors: We agree that an explicit quantitative assessment of tracker accuracy on our specific video collection would better substantiate the supervision quality. The tracker we employ is a recent state-of-the-art method whose robustness has been demonstrated on similar in-the-wild data, yet we did not report per-region error statistics for the collected set. In the revised manuscript we will add a dedicated analysis subsection that evaluates tracker precision and recall on a held-out video subset, with particular attention to low-texture and occluded areas, thereby providing a noise model that supports the contrastive-loss assumptions. revision: yes
Referee: [§4] §4 (experiments): the reported state-of-the-art results on geometry-aware point matching and monocular head tracking are presented without ablations that isolate the contribution of the contrastive term versus the auxiliary landmark/segmentation losses; this makes it difficult to verify whether the claimed robustness and full-head coverage follow from the primary supervision signal or from the multi-task regularizers.

Authors: We acknowledge that the current experimental section does not isolate the individual loss components. While the overall multi-task formulation is described, the contribution of the contrastive term relative to the landmark and segmentation auxiliaries is not quantified through controlled ablations. In the revision we will add a new ablation table that reports performance when the contrastive loss is removed or down-weighted, as well as when auxiliary terms are ablated individually, allowing readers to assess the primary role of the point-track supervision in achieving pose/identity invariance and full-head coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external tracker matches and architectural bottleneck without self-referential reduction

full rationale

The paper trains a ViT to predict per-pixel 3D embeddings inside a fixed canonical unit cube, supervised primarily by contrastive loss on pairwise matches produced by an off-the-shelf point tracker on external video data, plus auxiliary landmark, segmentation, and latent-cube continuity terms. No equation or claim reduces the output embeddings, their pose/identity invariance, or their utility for tracking/reconstruction to a quantity defined by the model's own fitted parameters or by a self-citation chain; the canonical-cube constraint is an explicit architectural choice that enforces consistency by construction of the representation space rather than by tautological re-use of the learned outputs. The central claims therefore rest on the independent quality of the external supervision signal and the multi-task regularizers, which are falsifiable outside the fitted model.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The method depends on the reliability of an external point tracker for supervision and on the assumption that a shared 3D cube plus contrastive loss will produce pose-consistent embeddings without additional geometric constraints.

free parameters (1)

embedding dimension (3D cube size)
The choice of a unit cube and the precise dimensionality of the latent features inside it are design decisions that affect the learned space.

axioms (2)

domain assumption Point tracks produced by a state-of-the-art tracker on in-the-wild videos provide sufficiently accurate and dense pairwise matches for contrastive supervision.
The dataset of pairwise point matches is collected directly from this tracker and used to drive the contrastive loss.
domain assumption Adding landmark and segmentation losses plus latent-cube feature sharing will enforce spatial continuity and semantic consistency without introducing new inconsistencies.
These multi-task and continuity terms are presented as sufficient to make the canonical space interpretable and queryable.

invented entities (1)

3D canonical unit cube no independent evidence
purpose: To serve as a shared, queryable coordinate space that maps every pixel to a consistent 3D location across poses and individuals.
The Vision Transformer is trained to output a 3D embedding that corresponds to a location inside this cube.

pith-pipeline@v0.9.0 · 5775 in / 1719 out tokens · 45055 ms · 2026-05-18T00:50:43.943238+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we decide to represent our canonical space as a unit cube in 3D ... latent feature grid E = gaussian filter 3D(Eraw, σ) ... trilinear interpolation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

contrastive loss ... Lcontr = ||norm(Feat1)norm(Feat1)^T - I||_F

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

[1]

ISBN 0201485605

ACM Press/Addison-Wesley Publishing Co. ISBN 0201485605. doi: 10.1145/311535.311556. URLhttps://doi.org/10.1145/311535.311556. Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision, pp....

work page doi:10.1145/311535.311556
[2]

Unsupervised Learning of Robust Spectral Shape Matching

Dongliang Cao, Paul Roetzer, and Florian Bernard. Unsupervised learning of robust spectral shape matching.arXiv preprint arXiv:2304.14419,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Easi3r: Estimating disen- tangled motion from DUSt3R without training.arXiv preprint arXiv:2503.24391,

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disen- tangled motion from dust3r without training.arXiv preprint arXiv:2503.24391, 2025a. Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. Long3r: Long sequence streaming 3d reconstruction.arXiv preprint arXiv:2507.18255, 2025b. Seokju Cho, Jiahui Hu...

work page arXiv
[4]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[5]

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.arXiv preprint arXiv:1804.03619,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152,

Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152,

work page arXiv
[7]

Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction.arXiv preprint arXiv:2505.00615,

Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lourdes Agapito, and Matthias Nießner. Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction.arXiv preprint arXiv:2505.00615,

work page arXiv
[8]

Neural head avatars from monocular rgb videos.arXiv preprint arXiv:2112.01554,

11 Philip-William Grassal, Malte Prinzler, Titus Leistner, Carsten Rother, Matthias Nießner, and Justus Thies. Neural head avatars from monocular rgb videos.arXiv preprint arXiv:2112.01554,

work page arXiv
[9]

Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik

Accessed: 2025-09-25. Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 7122–7131,

work page 2025
[10]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831, 2024a

Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831, 2024a. Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to tr...

work page arXiv
[11]

Learning to track any points from human motion.arXiv preprint arXiv:2507.06233,

In`es Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Jiahui Huang, Joon-Young Lee, and Seungryong Kim. Learning to track any points from human motion.arXiv preprint arXiv:2507.06233,

work page arXiv
[12]

Laplace-beltrami eigenfunctions towards an algorithm that” understands” geometry

Bruno L´evy. Laplace-beltrami eigenfunctions towards an algorithm that” understands” geometry. In IEEE International Conference on Shape Modeling and Applications 2006 (SMI’06), pp. 13–13. IEEE,

work page 2006
[13]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017a. URLhttps://doi.org/10.1145/3130800.3130813. Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial sh...

work page doi:10.1145/3130800.3130813
[14]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

MediaPipe: A Framework for Building Perception Pipelines

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Mediapipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172, 2019a. Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chu...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[16]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Wham: Reconstructing world- grounded humans with accurate 3d motion

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world- grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2070–2080,

work page 2070
[18]

DINOv3

URLhttps://arxiv.org/ abs/2508.10104. Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using multiview bootstrapping. InCVPR,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Dynamic point maps: A versatile representation for dynamic 3d reconstruction.arXiv preprint arXiv:2503.16318,

Edgar Sucar, Zihang Lai, Eldar Insafutdinov, and Andrea Vedaldi. Dynamic point maps: A versatile representation for dynamic 3d reconstruction.arXiv preprint arXiv:2503.16318,

work page arXiv
[21]

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547, 2023a. Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqi...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Densematcher: Learning 3d semantic correspondence for category-level manipulation from a single demo.arXiv preprint arXiv:2412.05268,

Junzhe Zhu, Yuanchen Ju, Junyi Zhang, Muhan Wang, Zhecheng Yuan, Kaizhe Hu, and Huazhe Xu. Densematcher: Learning 3d semantic correspondence for category-level manipulation from a single demo.arXiv preprint arXiv:2412.05268,

work page arXiv
[23]

Even without DINOv3 pretraining, our ap- proach achieves better results than the strongest baseline, highlighting the importance of the canon- ical space. Moreover, without the canonical space, the network embeddings cannot be reliably used for point querying or region localization, as the model may map semantically different regions close together in the...

work page 2023