Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks
Pith reviewed 2026-05-18 00:50 UTC · model grok-4.3
The pith
DenseMarks learns pixel-wise 3D embeddings in a shared canonical cube for human heads by supervising a Vision Transformer with dense point tracks from videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose DenseMarks, a new learned representation for human heads where a Vision Transformer network predicts a 3D embedding for each pixel that corresponds to a location in a 3D canonical unit cube. Training collects pairwise point matches estimated by a state-of-the-art point tracker over diverse in-the-wild talking heads videos and guides the mapping via a contrastive loss that encourages matched points to have close embeddings. Multi-task learning with face landmarks and segmentation constraints together with spatial continuity imposed through latent cube features produces an interpretable and queryable canonical space that remains consistent across poses and individuals.
What carries the argument
A Vision Transformer that outputs a 3D embedding vector for every pixel, constrained to lie inside a shared canonical unit cube and trained with contrastive loss on automatically tracked point pairs.
If this is right
- Matched points across different head images can be found by nearest-neighbor lookup in the canonical cube.
- Monocular head tracking can be performed by fitting a 3D Morphable Model to the predicted embeddings.
- Stereo reconstruction benefits from the dense geometry-aware correspondences produced by the embedding.
- The representation remains stable under large pose changes and includes hair as well as facial skin.
- Embeddings stay aligned for the same semantic location even when images come from different individuals.
Where Pith is reading between the lines
- The canonical cube could be queried to transfer texture or expression between different heads without an explicit 3D mesh.
- Similar track-based supervision might extend the same idea to full-body or hand surfaces if comparable video data exists.
- Interactive applications could let users click a point in one view and instantly locate its counterpart in another view via the shared cube coordinates.
Load-bearing premise
Pairwise point matches produced by an off-the-shelf tracker on in-the-wild videos are accurate and dense enough to serve as reliable supervision for learning a consistent canonical embedding.
What would settle it
On a test collection of head images with independently verified dense correspondences, the learned embeddings for truly corresponding points fail to lie close together inside the canonical cube.
Figures
read the original abstract
We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DenseMarks, a ViT-based model that maps pixels from 2D human head images to 3D embeddings inside a canonical unit cube. Training uses contrastive loss on dense pairwise point matches produced by an off-the-shelf tracker applied to in-the-wild talking-head videos, augmented by multi-task losses on landmarks and segmentation plus a continuity regularizer on the latent cube. The resulting embedding is claimed to be pose- and identity-invariant, to cover the full head including hair, and to enable applications such as semantic part correspondence, head tracking, and stereo reconstruction, with reported state-of-the-art performance on geometry-aware point matching and monocular 3DMM tracking.
Significance. If the central claims hold, the work supplies a practical, queryable canonical representation for dense head correspondences that could improve downstream tasks in face analysis and reconstruction. The explicit use of an external tracker for supervision and the release of code and checkpoints are positive contributions that facilitate reproducibility.
major comments (2)
- [§3.2] §3.2 (contrastive loss formulation): the claim that the canonical-space bottleneck plus contrastive loss produces embeddings that are robust to pose and identity variations rests on the assumption that the off-the-shelf point tracks are sufficiently accurate and dense; no quantitative error analysis or noise model for the tracker outputs on the collected video set is provided, leaving open the possibility that embedding inconsistencies arise directly from tracker failures in low-texture or occluded regions.
- [§4] §4 (experiments): the reported state-of-the-art results on geometry-aware point matching and monocular head tracking are presented without ablations that isolate the contribution of the contrastive term versus the auxiliary landmark/segmentation losses; this makes it difficult to verify whether the claimed robustness and full-head coverage follow from the primary supervision signal or from the multi-task regularizers.
minor comments (2)
- [Abstract] The abstract and introduction repeatedly use the phrase 'strong supervision' without defining its quantitative strength relative to the tracker error rate.
- [Figures] Figure captions for the qualitative results should include the specific video sequences or identities shown to allow readers to assess coverage of hair and extreme poses.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential of DenseMarks. We address the two major comments below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (contrastive loss formulation): the claim that the canonical-space bottleneck plus contrastive loss produces embeddings that are robust to pose and identity variations rests on the assumption that the off-the-shelf point tracks are sufficiently accurate and dense; no quantitative error analysis or noise model for the tracker outputs on the collected video set is provided, leaving open the possibility that embedding inconsistencies arise directly from tracker failures in low-texture or occluded regions.
Authors: We agree that an explicit quantitative assessment of tracker accuracy on our specific video collection would better substantiate the supervision quality. The tracker we employ is a recent state-of-the-art method whose robustness has been demonstrated on similar in-the-wild data, yet we did not report per-region error statistics for the collected set. In the revised manuscript we will add a dedicated analysis subsection that evaluates tracker precision and recall on a held-out video subset, with particular attention to low-texture and occluded areas, thereby providing a noise model that supports the contrastive-loss assumptions. revision: yes
-
Referee: [§4] §4 (experiments): the reported state-of-the-art results on geometry-aware point matching and monocular head tracking are presented without ablations that isolate the contribution of the contrastive term versus the auxiliary landmark/segmentation losses; this makes it difficult to verify whether the claimed robustness and full-head coverage follow from the primary supervision signal or from the multi-task regularizers.
Authors: We acknowledge that the current experimental section does not isolate the individual loss components. While the overall multi-task formulation is described, the contribution of the contrastive term relative to the landmark and segmentation auxiliaries is not quantified through controlled ablations. In the revision we will add a new ablation table that reports performance when the contrastive loss is removed or down-weighted, as well as when auxiliary terms are ablated individually, allowing readers to assess the primary role of the point-track supervision in achieving pose/identity invariance and full-head coverage. revision: yes
Circularity Check
No circularity: derivation uses external tracker matches and architectural bottleneck without self-referential reduction
full rationale
The paper trains a ViT to predict per-pixel 3D embeddings inside a fixed canonical unit cube, supervised primarily by contrastive loss on pairwise matches produced by an off-the-shelf point tracker on external video data, plus auxiliary landmark, segmentation, and latent-cube continuity terms. No equation or claim reduces the output embeddings, their pose/identity invariance, or their utility for tracking/reconstruction to a quantity defined by the model's own fitted parameters or by a self-citation chain; the canonical-cube constraint is an explicit architectural choice that enforces consistency by construction of the representation space rather than by tautological re-use of the learned outputs. The central claims therefore rest on the independent quality of the external supervision signal and the multi-task regularizers, which are falsifiable outside the fitted model.
Axiom & Free-Parameter Ledger
free parameters (1)
- embedding dimension (3D cube size)
axioms (2)
- domain assumption Point tracks produced by a state-of-the-art tracker on in-the-wild videos provide sufficiently accurate and dense pairwise matches for contrastive supervision.
- domain assumption Adding landmark and segmentation losses plus latent-cube feature sharing will enforce spatial continuity and semantic consistency without introducing new inconsistencies.
invented entities (1)
-
3D canonical unit cube
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we decide to represent our canonical space as a unit cube in 3D ... latent feature grid E = gaussian filter 3D(Eraw, σ) ... trilinear interpolation
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
contrastive loss ... Lcontr = ||norm(Feat1)norm(Feat1)^T - I||_F
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ACM Press/Addison-Wesley Publishing Co. ISBN 0201485605. doi: 10.1145/311535.311556. URLhttps://doi.org/10.1145/311535.311556. Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision, pp....
-
[2]
Unsupervised Learning of Robust Spectral Shape Matching
Dongliang Cao, Paul Roetzer, and Florian Bernard. Unsupervised learning of robust spectral shape matching.arXiv preprint arXiv:2304.14419,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disen- tangled motion from dust3r without training.arXiv preprint arXiv:2503.24391, 2025a. Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. Long3r: Long sequence streaming 3d reconstruction.arXiv preprint arXiv:2507.18255, 2025b. Seokju Cho, Jiahui Hu...
-
[4]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[5]
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.arXiv preprint arXiv:1804.03619,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152,
Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152,
-
[7]
Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lourdes Agapito, and Matthias Nießner. Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction.arXiv preprint arXiv:2505.00615,
-
[8]
Neural head avatars from monocular rgb videos.arXiv preprint arXiv:2112.01554,
11 Philip-William Grassal, Malte Prinzler, Titus Leistner, Carsten Rother, Matthias Nießner, and Justus Thies. Neural head avatars from monocular rgb videos.arXiv preprint arXiv:2112.01554,
-
[9]
Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik
Accessed: 2025-09-25. Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 7122–7131,
work page 2025
-
[10]
Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831, 2024a. Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to tr...
-
[11]
Learning to track any points from human motion.arXiv preprint arXiv:2507.06233,
In`es Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Jiahui Huang, Joon-Young Lee, and Seungryong Kim. Learning to track any points from human motion.arXiv preprint arXiv:2507.06233,
-
[12]
Laplace-beltrami eigenfunctions towards an algorithm that” understands” geometry
Bruno L´evy. Laplace-beltrami eigenfunctions towards an algorithm that” understands” geometry. In IEEE International Conference on Shape Modeling and Applications 2006 (SMI’06), pp. 13–13. IEEE,
work page 2006
-
[13]
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017a. URLhttps://doi.org/10.1145/3130800.3130813. Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial sh...
-
[14]
Decoupled Weight Decay Regularization
I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
MediaPipe: A Framework for Building Perception Pipelines
Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Mediapipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172, 2019a. Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chu...
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[16]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Wham: Reconstructing world- grounded humans with accurate 3d motion
Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world- grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2070–2080,
work page 2070
-
[18]
URLhttps://arxiv.org/ abs/2508.10104. Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using multiview bootstrapping. InCVPR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs
Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Edgar Sucar, Zihang Lai, Eldar Insafutdinov, and Andrea Vedaldi. Dynamic point maps: A versatile representation for dynamic 3d reconstruction.arXiv preprint arXiv:2503.16318,
-
[21]
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547, 2023a. Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Junzhe Zhu, Yuanchen Ju, Junyi Zhang, Muhan Wang, Zhecheng Yuan, Kaizhe Hu, and Huazhe Xu. Densematcher: Learning 3d semantic correspondence for category-level manipulation from a single demo.arXiv preprint arXiv:2412.05268,
-
[23]
Even without DINOv3 pretraining, our ap- proach achieves better results than the strongest baseline, highlighting the importance of the canon- ical space. Moreover, without the canonical space, the network embeddings cannot be reliably used for point querying or region localization, as the model may map semantically different regions close together in the...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.