pith. machine review for the scientific record. sign in

arxiv: 2511.02830 · v2 · submitted 2025-11-04 · 💻 cs.CV

Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

Pith reviewed 2026-05-18 00:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords canonical embeddingsdense correspondenceshuman headspoint tracksvision transformerface trackingcontrastive loss3D morphable models
0
0 comments X

The pith

DenseMarks learns pixel-wise 3D embeddings in a shared canonical cube for human heads by supervising a Vision Transformer with dense point tracks from videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to produce dense, consistent correspondences for human head images by mapping every pixel to a location inside a fixed 3D unit cube. Training relies on contrastive loss applied to point matches automatically extracted by a tracker across many in-the-wild talking-head videos, augmented by landmark detection, segmentation, and continuity constraints inside the cube. Because the supervision is dense and the bottleneck is a single canonical space, the resulting embeddings stay aligned even when heads move or belong to different people and extend to hair regions. A reader would care because this yields a queryable representation usable for tracking, semantic part matching, and stereo tasks without requiring manual 3D labels or per-person fitting.

Core claim

We propose DenseMarks, a new learned representation for human heads where a Vision Transformer network predicts a 3D embedding for each pixel that corresponds to a location in a 3D canonical unit cube. Training collects pairwise point matches estimated by a state-of-the-art point tracker over diverse in-the-wild talking heads videos and guides the mapping via a contrastive loss that encourages matched points to have close embeddings. Multi-task learning with face landmarks and segmentation constraints together with spatial continuity imposed through latent cube features produces an interpretable and queryable canonical space that remains consistent across poses and individuals.

What carries the argument

A Vision Transformer that outputs a 3D embedding vector for every pixel, constrained to lie inside a shared canonical unit cube and trained with contrastive loss on automatically tracked point pairs.

If this is right

  • Matched points across different head images can be found by nearest-neighbor lookup in the canonical cube.
  • Monocular head tracking can be performed by fitting a 3D Morphable Model to the predicted embeddings.
  • Stereo reconstruction benefits from the dense geometry-aware correspondences produced by the embedding.
  • The representation remains stable under large pose changes and includes hair as well as facial skin.
  • Embeddings stay aligned for the same semantic location even when images come from different individuals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The canonical cube could be queried to transfer texture or expression between different heads without an explicit 3D mesh.
  • Similar track-based supervision might extend the same idea to full-body or hand surfaces if comparable video data exists.
  • Interactive applications could let users click a point in one view and instantly locate its counterpart in another view via the shared cube coordinates.

Load-bearing premise

Pairwise point matches produced by an off-the-shelf tracker on in-the-wild videos are accurate and dense enough to serve as reliable supervision for learning a consistent canonical embedding.

What would settle it

On a test collection of head images with independently verified dense correspondences, the learned embeddings for truly corresponding points fail to lie close together inside the canonical cube.

Figures

Figures reproduced from arXiv: 2511.02830 by Alexey Artemov, Ananta R. Bhattarai, Artem Sevastopolsky, Dmitrii Pozdeev.

Figure 1
Figure 1. Figure 1: Our method learns to embed a human head image into a semantics-aware volumetric [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: To learn our representation, we train an embedder network [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Point querying. We select a specific point on a few images and find the reference embed [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Semantic regions on head images can be located via selecting corresponding volumetric [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dense warping. Here, we copy pixels from source to target based on the target [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Monocular tracking. We evaluate our method on downstream application of applying [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Stereo Reconstruction. We triangulate 2-view and 3-view correspondences of our rep￾resentations using known camera parameters in Nersemble (Kirschstein et al., 2023). Region selection. In [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Removing the landmark or segmenta￾tion loss makes region finding much less reliable. Blue: forehead center, green and orange: ears, yellow: skin near the left eyebrow corner. Losses ablation. Even though the network can learn without introduced constraints on land￾mark locations in the cube and segmentation loss, we demonstrate that the finding charac￾teristic points and regions becomes more prob￾lematic i… view at source ↗
Figure 9
Figure 9. Figure 9: Dense warping. Pixels are copied from the source to the target based on a target [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Blue: forehead center, green and orange: ears, yellow: skin near the left eyebrow corner. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Predicted canonical coordinates are robust to lighting changes. Ours: result of [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: For each pose (colored images on the left), we visualize the surface corresponding to [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Stereo reconstruction results for two, three (narrow/wide baseline), and five views. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DenseMarks, a ViT-based model that maps pixels from 2D human head images to 3D embeddings inside a canonical unit cube. Training uses contrastive loss on dense pairwise point matches produced by an off-the-shelf tracker applied to in-the-wild talking-head videos, augmented by multi-task losses on landmarks and segmentation plus a continuity regularizer on the latent cube. The resulting embedding is claimed to be pose- and identity-invariant, to cover the full head including hair, and to enable applications such as semantic part correspondence, head tracking, and stereo reconstruction, with reported state-of-the-art performance on geometry-aware point matching and monocular 3DMM tracking.

Significance. If the central claims hold, the work supplies a practical, queryable canonical representation for dense head correspondences that could improve downstream tasks in face analysis and reconstruction. The explicit use of an external tracker for supervision and the release of code and checkpoints are positive contributions that facilitate reproducibility.

major comments (2)
  1. [§3.2] §3.2 (contrastive loss formulation): the claim that the canonical-space bottleneck plus contrastive loss produces embeddings that are robust to pose and identity variations rests on the assumption that the off-the-shelf point tracks are sufficiently accurate and dense; no quantitative error analysis or noise model for the tracker outputs on the collected video set is provided, leaving open the possibility that embedding inconsistencies arise directly from tracker failures in low-texture or occluded regions.
  2. [§4] §4 (experiments): the reported state-of-the-art results on geometry-aware point matching and monocular head tracking are presented without ablations that isolate the contribution of the contrastive term versus the auxiliary landmark/segmentation losses; this makes it difficult to verify whether the claimed robustness and full-head coverage follow from the primary supervision signal or from the multi-task regularizers.
minor comments (2)
  1. [Abstract] The abstract and introduction repeatedly use the phrase 'strong supervision' without defining its quantitative strength relative to the tracker error rate.
  2. [Figures] Figure captions for the qualitative results should include the specific video sequences or identities shown to allow readers to assess coverage of hair and extreme poses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of DenseMarks. We address the two major comments below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (contrastive loss formulation): the claim that the canonical-space bottleneck plus contrastive loss produces embeddings that are robust to pose and identity variations rests on the assumption that the off-the-shelf point tracks are sufficiently accurate and dense; no quantitative error analysis or noise model for the tracker outputs on the collected video set is provided, leaving open the possibility that embedding inconsistencies arise directly from tracker failures in low-texture or occluded regions.

    Authors: We agree that an explicit quantitative assessment of tracker accuracy on our specific video collection would better substantiate the supervision quality. The tracker we employ is a recent state-of-the-art method whose robustness has been demonstrated on similar in-the-wild data, yet we did not report per-region error statistics for the collected set. In the revised manuscript we will add a dedicated analysis subsection that evaluates tracker precision and recall on a held-out video subset, with particular attention to low-texture and occluded areas, thereby providing a noise model that supports the contrastive-loss assumptions. revision: yes

  2. Referee: [§4] §4 (experiments): the reported state-of-the-art results on geometry-aware point matching and monocular head tracking are presented without ablations that isolate the contribution of the contrastive term versus the auxiliary landmark/segmentation losses; this makes it difficult to verify whether the claimed robustness and full-head coverage follow from the primary supervision signal or from the multi-task regularizers.

    Authors: We acknowledge that the current experimental section does not isolate the individual loss components. While the overall multi-task formulation is described, the contribution of the contrastive term relative to the landmark and segmentation auxiliaries is not quantified through controlled ablations. In the revision we will add a new ablation table that reports performance when the contrastive loss is removed or down-weighted, as well as when auxiliary terms are ablated individually, allowing readers to assess the primary role of the point-track supervision in achieving pose/identity invariance and full-head coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external tracker matches and architectural bottleneck without self-referential reduction

full rationale

The paper trains a ViT to predict per-pixel 3D embeddings inside a fixed canonical unit cube, supervised primarily by contrastive loss on pairwise matches produced by an off-the-shelf point tracker on external video data, plus auxiliary landmark, segmentation, and latent-cube continuity terms. No equation or claim reduces the output embeddings, their pose/identity invariance, or their utility for tracking/reconstruction to a quantity defined by the model's own fitted parameters or by a self-citation chain; the canonical-cube constraint is an explicit architectural choice that enforces consistency by construction of the representation space rather than by tautological re-use of the learned outputs. The central claims therefore rest on the independent quality of the external supervision signal and the multi-task regularizers, which are falsifiable outside the fitted model.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The method depends on the reliability of an external point tracker for supervision and on the assumption that a shared 3D cube plus contrastive loss will produce pose-consistent embeddings without additional geometric constraints.

free parameters (1)
  • embedding dimension (3D cube size)
    The choice of a unit cube and the precise dimensionality of the latent features inside it are design decisions that affect the learned space.
axioms (2)
  • domain assumption Point tracks produced by a state-of-the-art tracker on in-the-wild videos provide sufficiently accurate and dense pairwise matches for contrastive supervision.
    The dataset of pairwise point matches is collected directly from this tracker and used to drive the contrastive loss.
  • domain assumption Adding landmark and segmentation losses plus latent-cube feature sharing will enforce spatial continuity and semantic consistency without introducing new inconsistencies.
    These multi-task and continuity terms are presented as sufficient to make the canonical space interpretable and queryable.
invented entities (1)
  • 3D canonical unit cube no independent evidence
    purpose: To serve as a shared, queryable coordinate space that maps every pixel to a consistent 3D location across poses and individuals.
    The Vision Transformer is trained to output a 3D embedding that corresponds to a location inside this cube.

pith-pipeline@v0.9.0 · 5775 in / 1719 out tokens · 45055 ms · 2026-05-18T00:50:43.943238+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

  1. [1]

    ISBN 0201485605

    ACM Press/Addison-Wesley Publishing Co. ISBN 0201485605. doi: 10.1145/311535.311556. URLhttps://doi.org/10.1145/311535.311556. Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision, pp....

  2. [2]

    Unsupervised Learning of Robust Spectral Shape Matching

    Dongliang Cao, Paul Roetzer, and Florian Bernard. Unsupervised learning of robust spectral shape matching.arXiv preprint arXiv:2304.14419,

  3. [3]

    Easi3r: Estimating disen- tangled motion from DUSt3R without training.arXiv preprint arXiv:2503.24391,

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disen- tangled motion from dust3r without training.arXiv preprint arXiv:2503.24391, 2025a. Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. Long3r: Long sequence streaming 3d reconstruction.arXiv preprint arXiv:2507.18255, 2025b. Seokju Cho, Jiahui Hu...

  4. [4]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  5. [5]

    Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

    Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.arXiv preprint arXiv:1804.03619,

  6. [6]

    St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152,

    Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152,

  7. [7]

    Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction.arXiv preprint arXiv:2505.00615,

    Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lourdes Agapito, and Matthias Nießner. Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction.arXiv preprint arXiv:2505.00615,

  8. [8]

    Neural head avatars from monocular rgb videos.arXiv preprint arXiv:2112.01554,

    11 Philip-William Grassal, Malte Prinzler, Titus Leistner, Carsten Rother, Matthias Nießner, and Justus Thies. Neural head avatars from monocular rgb videos.arXiv preprint arXiv:2112.01554,

  9. [9]

    Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik

    Accessed: 2025-09-25. Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 7122–7131,

  10. [10]

    Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831, 2024a

    Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831, 2024a. Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to tr...

  11. [11]

    Learning to track any points from human motion.arXiv preprint arXiv:2507.06233,

    In`es Hyeonsu Kim, Seokju Cho, Jahyeok Koo, Junghyun Park, Jiahui Huang, Joon-Young Lee, and Seungryong Kim. Learning to track any points from human motion.arXiv preprint arXiv:2507.06233,

  12. [12]

    Laplace-beltrami eigenfunctions towards an algorithm that” understands” geometry

    Bruno L´evy. Laplace-beltrami eigenfunctions towards an algorithm that” understands” geometry. In IEEE International Conference on Shape Modeling and Applications 2006 (SMI’06), pp. 13–13. IEEE,

  13. [13]

    Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017a. URLhttps://doi.org/10.1145/3130800.3130813. Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial sh...

  14. [14]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  15. [15]

    MediaPipe: A Framework for Building Perception Pipelines

    Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Mediapipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172, 2019a. Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chu...

  16. [16]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

  17. [17]

    Wham: Reconstructing world- grounded humans with accurate 3d motion

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world- grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2070–2080,

  18. [18]

    DINOv3

    URLhttps://arxiv.org/ abs/2508.10104. Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using multiview bootstrapping. InCVPR,

  19. [19]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912,

  20. [20]

    Dynamic point maps: A versatile representation for dynamic 3d reconstruction.arXiv preprint arXiv:2503.16318,

    Edgar Sucar, Zihang Lai, Eldar Insafutdinov, and Andrea Vedaldi. Dynamic point maps: A versatile representation for dynamic 3d reconstruction.arXiv preprint arXiv:2503.16318,

  21. [21]

    MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547, 2023a. Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqi...

  22. [22]

    Densematcher: Learning 3d semantic correspondence for category-level manipulation from a single demo.arXiv preprint arXiv:2412.05268,

    Junzhe Zhu, Yuanchen Ju, Junyi Zhang, Muhan Wang, Zhecheng Yuan, Kaizhe Hu, and Huazhe Xu. Densematcher: Learning 3d semantic correspondence for category-level manipulation from a single demo.arXiv preprint arXiv:2412.05268,

  23. [23]

    Even without DINOv3 pretraining, our ap- proach achieves better results than the strongest baseline, highlighting the importance of the canon- ical space. Moreover, without the canonical space, the network embeddings cannot be reliably used for point querying or region localization, as the model may map semantically different regions close together in the...