pith. machine review for the scientific record. sign in

arxiv: 2605.07766 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Head Similarity: Modeling Structured Whole-Head Appearance Beyond Face Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords head similaritywhole-head appearanceface recognitionappearance variationidentity consistencyhierarchical supervisionvideo benchmarkweakly-supervised
0
0 comments X

The pith

Head Similarity extends face recognition to model structured whole-head appearance variations including hairstyle and styling changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard face recognition models, by forcing all images of a person into one invariant representation, lose information about changeable appearance features like hair or accessories. This limits their usefulness in scenarios where identity must be consistent despite such changes or when faces are not visible. The authors propose Head Similarity as a way to explicitly model these variations through hierarchical ordering of similarities at both identity and appearance levels. They support this with a new benchmark built from video data using weak supervision on appearance states and a training framework that combines identity and appearance objectives.

Core claim

Head Similarity is a formulation that extends identity-centric recognition to structured whole-head similarity modeling by capturing intra-identity appearance variation and enforcing hierarchical similarity ordering across identity and appearance states, demonstrated feasible via a framework using hierarchical supervision and identity-aware distillation on a video-derived benchmark.

What carries the argument

The Head Similarity formulation, which explicitly captures intra-identity appearance variation and enforces hierarchical similarity ordering across identity and appearance states.

If this is right

  • Meaningful similarity comparisons remain possible even under occlusion or rear-view conditions where facial cues are absent.
  • Conventional face recognition models are shown to fail at capturing appearance-dependent similarity.
  • Applications requiring identity consistency beyond strict biometric recognition can use whole-head cues.
  • A large-scale benchmark from long-form videos enables training for diverse poses and temporal changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such models could improve person re-identification in videos with frequent appearance changes.
  • Embedding spaces might need to represent multiple appearance states per identity rather than single points.
  • Future work could test generalization to real-world surveillance footage without video-based weak labels.

Load-bearing premise

A large-scale benchmark from long-form videos with weakly-supervised appearance states sufficiently captures diverse poses, occlusions, and temporal changes to train effective models.

What would settle it

A standard face recognition model trained on the same benchmark achieves comparable accuracy on tasks measuring appearance-dependent similarity and hierarchical ordering as the proposed Head Similarity framework.

Figures

Figures reproduced from arXiv: 2605.07766 by Shengcai Liao, Yingfeng Wang, Yuxuan Xiao.

Figure 1
Figure 1. Figure 1: Failure cases of AdaFace on whole-head similarity. The goal is not to verify legal identity, but to pre￾serve the perception that the sequence depicts the same person. Conventional face recognition is designed for a different objective. Modern sys￾tems learn identity-invariant embeddings with margin-based metric learning losses Deng et al. [2019], Kim et al. [2022], deliberately suppress￾ing intra-identity… view at source ↗
Figure 2
Figure 2. Figure 2: Conceptual comparison between identity-centric face recognition and our proposed Head [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the hierarchical similarity structure [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall training framework for Head Similarity. A dual-CLS Vision Transformer backbone [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pipeline of the Head Similarity dataset construction. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ROC curves under aligned and whole￾head inputs. We analyze the effect of adapting face-recognition models from aligned-face inputs to unaligned whole-head images. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ROC curves comparison on the HeadSim-Head dataset. As reported in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top-3 retrieval results on the HeadSim-Head test set for AdaFace and our method under [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ROC curves on HeadSim-Head for dif￾ferent configurations. Dual-CLS consistently out￾performs other variants. To analyze the conflict between identity invari￾ance and appearance-sensitive similarity, we evaluate different architectural variants and loss assignments, as summarized in [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Many vision applications require identity consistency beyond strict biometric recognition, especially under non-frontal views or when facial cues are missing. However, conventional face recognition models enforce intra-identity invariance, collapsing appearance variations such as hairstyle or styling changes into a single representation, limiting their use in appearance-sensitive scenarios. To address this limitation, we introduce Head Similarity, a new formulation that extends identity-centric recognition to structured whole-head similarity modeling. Our approach explicitly captures intra-identity appearance variation and enforces hierarchical similarity ordering across identity and appearance states, enabling meaningful comparison even under occlusion or rear-view conditions. We construct a large-scale benchmark from long-form videos with weakly-supervised appearance states, covering diverse poses, occlusions, and temporal changes. As a first step, we develop a simple yet effective framework that jointly models identity discrimination and appearance-sensitive similarity through hierarchical supervision and identity-aware distillation. Experiments show that conventional face recognition models fail to capture appearance-dependent similarity, while our approach demonstrates the feasibility of structured whole-head similarity modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Head Similarity as a new formulation extending identity-centric face recognition to structured whole-head similarity modeling that explicitly captures intra-identity appearance variations (e.g., hairstyle, styling, occlusion). It constructs a large-scale benchmark from long-form videos using weakly-supervised appearance state labels and proposes a simple framework combining hierarchical supervision with identity-aware distillation. Experiments are presented to show that conventional face recognition collapses appearance variation while the proposed approach demonstrates feasibility of appearance-dependent similarity under diverse poses and views.

Significance. If the central claims hold after addressing validation gaps, the work could meaningfully advance computer vision applications needing nuanced identity consistency beyond biometrics, such as video-based re-identification or non-frontal analysis. The benchmark and hierarchical supervision idea provide a concrete starting point for future research on appearance-sensitive modeling. Credit is due for framing the problem clearly and releasing a new data resource, though the significance is tempered by the absence of label-quality diagnostics that would allow readers to trust the reported gaps versus baselines.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The weakly-supervised appearance state labels extracted from long-form videos are load-bearing for the hierarchical similarity ordering and all downstream claims, yet the section provides no quantitative validation (e.g., label accuracy vs. manual annotation, inter-state consistency under pose variation, or noise-robustness checks). Without such evidence, it remains possible that performance differences versus face-recognition baselines arise from label artifacts rather than the modeling approach.
  2. [§5] §5 (Experiments): The claim that conventional face recognition models fail to capture appearance-dependent similarity while the proposed method succeeds is central, but the reported results lack concrete metrics, error bars, statistical significance tests, or ablation isolating the contribution of hierarchical supervision versus identity-aware distillation. This makes it difficult to evaluate whether the feasibility demonstration is robust.
minor comments (2)
  1. [Abstract] Abstract: The high-level description of the benchmark and framework is clear, but adding one sentence on dataset scale (number of identities, videos, and appearance states) would help readers gauge its coverage of pose/occlusion diversity.
  2. [Method] Notation: The distinction between identity discrimination loss and appearance-sensitive similarity loss could be clarified with a short equation or diagram in the method section to avoid ambiguity for readers unfamiliar with distillation setups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The weakly-supervised appearance state labels extracted from long-form videos are load-bearing for the hierarchical similarity ordering and all downstream claims, yet the section provides no quantitative validation (e.g., label accuracy vs. manual annotation, inter-state consistency under pose variation, or noise-robustness checks). Without such evidence, it remains possible that performance differences versus face-recognition baselines arise from label artifacts rather than the modeling approach.

    Authors: We acknowledge that explicit validation of the weakly-supervised labels is essential for establishing trust in the benchmark. The labels are generated via a temporal consistency and clustering pipeline applied to long-form video tracks, but the current manuscript does not include quantitative diagnostics. In the revised version, we will add a new subsection under §3 that reports: (i) agreement metrics (accuracy, Cohen’s kappa) on a manually annotated subset of 1,000 randomly sampled tracks stratified by pose and occlusion; (ii) inter-state consistency analysis by computing intra- and inter-state similarity distributions under frontal vs. non-frontal views; and (iii) a noise-robustness check by injecting controlled label flips and re-running key experiments. These additions will allow readers to assess whether performance gaps reflect modeling improvements rather than label artifacts. revision: yes

  2. Referee: [§5] §5 (Experiments): The claim that conventional face recognition models fail to capture appearance-dependent similarity while the proposed method succeeds is central, but the reported results lack concrete metrics, error bars, statistical significance tests, or ablation isolating the contribution of hierarchical supervision versus identity-aware distillation. This makes it difficult to evaluate whether the feasibility demonstration is robust.

    Authors: We agree that the experimental presentation requires greater rigor to support the central claims. In the revision we will: (1) report all similarity metrics with error bars computed over five independent training runs using different random seeds; (2) include paired statistical significance tests (e.g., t-tests with p-values) comparing our method against each baseline; (3) add a dedicated ablation table that isolates hierarchical supervision (by removing the appearance-state ordering loss) and identity-aware distillation (by removing the distillation term) while keeping all other components fixed; and (4) expand the metric suite to include mean average precision and rank-1 accuracy in addition to the current similarity scores. These changes will make the feasibility demonstration more robust and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: new formulation and framework with no derivations or self-referential reductions

full rationale

The paper introduces Head Similarity as a new formulation extending face recognition to structured whole-head modeling, constructs a benchmark from long-form videos using weakly-supervised appearance states, and proposes a framework with hierarchical supervision plus identity-aware distillation. No equations, parameter fittings, predictions, or derivations are present in the abstract or described approach. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The central claim is a feasibility demonstration via experiments comparing to conventional models, which remains independent of any input reduction or self-definition. This qualifies as a self-contained new-task proposal with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no mathematical details, so no free parameters, axioms, or invented entities can be identified; the work relies on standard deep learning practices and a new benchmark construction approach at a conceptual level.

pith-pipeline@v0.9.0 · 5468 in / 1194 out tokens · 132356 ms · 2026-05-11T02:51:08.051073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 2 internal anchors

  1. [1]

    Partial fc: Training 10 million identities on a single machine

    Xiang An, Xuhan Zhu, Yuan Gao, Yang Xiao, Yongle Zhao, Ziyong Feng, Lan Wu, Bin Qin, Ming Zhang, Debing Zhang, et al. Partial fc: Training 10 million identities on a single machine. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1445--1449, 2021

  2. [2]

    Vggface2: A dataset for recognising faces across pose and age

    Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67--74. IEEE, 2018

  3. [3]

    Hairnerf: Geometry-aware image synthesis for hairstyle transfer

    Seunggyu Chang, Gihoon Kim, and Hayeon Kim. Hairnerf: Geometry-aware image synthesis for hairstyle transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2448--2458, 2023

  4. [5]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690--4699, 2019

  5. [6]

    Retinaface: Single-shot multi-level face localisation in the wild

    Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203--5212, 2020

  6. [8]

    Hyperbolic metric learning for visual outlier detection

    Alvaro Gonzalez-Jimenez, Simone Lionetti, Dena Bazazian, Philippe Gottfrois, Fabian Gr \"o ger, Alexander Navarini, and Marc Pouly. Hyperbolic metric learning for visual outlier detection. In European Conference on Computer Vision, pages 327--344. Springer, 2024

  7. [9]

    Clothes-changing person re-identification with rgb modality only

    Xinqian Gu, Hong Chang, Bingpeng Ma, Shutao Bai, Shiguang Shan, and Xilin Chen. Clothes-changing person re-identification with rgb modality only. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1060--1069, 2022

  8. [10]

    Dimensionality reduction by learning an invariant mapping

    Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06), volume 2, pages 1735--1742. IEEE, 2006

  9. [11]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016

  10. [12]

    Transreid: Transformer-based object re-identification

    Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 15013--15022, 2021

  11. [13]

    Head360: Learning a parametric 3d full-head for free-view synthesis in 360 ^

    Yuxiao He, Yiyu Zhuang, Yanwen Wang, Yao Yao, Siyu Zhu, Xiaoyu Li, Qi Zhang, Xun Cao, and Hao Zhu. Head360: Learning a parametric 3d full-head for free-view synthesis in 360 ^ . In European Conference on Computer Vision, pages 254--272. Springer, 2024

  12. [14]

    Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis

    Rui Huang, Shu Zhang, Tianyu Li, and Ran He. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE international conference on computer vision, pages 2439--2448, 2017

  13. [15]

    Supervised contrastive learning

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33: 0 18661--18673, 2020

  14. [16]

    Adaface: Quality adaptive margin for face recognition

    Minchul Kim, Anil K Jain, and Xiaoming Liu. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18750--18759, 2022

  15. [17]

    Hier: Metric learning beyond class labels via hierarchical regularization

    Sungyeon Kim, Boseung Jeong, and Suha Kwak. Hier: Metric learning beyond class labels via hierarchical regularization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19903--19912, 2023

  16. [18]

    Partial face recognition: Alignment-free approach

    Shengcai Liao, Anil K Jain, and Stan Z Li. Partial face recognition: Alignment-free approach. IEEE Transactions on pattern analysis and machine intelligence, 35 0 (5): 0 1193--1205, 2012

  17. [19]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740--755. Springer, 2014

  18. [20]

    No fuss distance metric learning using proxies

    Yair Movshovitz-Attias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In Proceedings of the IEEE international conference on computer vision, pages 360--368, 2017

  19. [21]

    Long-term cloth-changing person re-identification

    Xuelin Qian, Wenxuan Wang, Li Zhang, Fangrui Zhu, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, and Xiangyang Xue. Long-term cloth-changing person re-identification. In Proceedings of the Asian conference on computer vision, 2020

  20. [23]

    Facenet: A unified embedding for face recognition and clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815--823, 2015

  21. [24]

    First order motion model for image animation

    Aliaksandr Siarohin, St \'e phane Lathuili \`e re, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. Advances in neural information processing systems, 32, 2019

  22. [25]

    Everybody’s talkin’: Let me talk as you want

    Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 17: 0 585--598, 2022

  23. [26]

    Learning part-based convolutional features for person re-identification

    Yifan Sun, Liang Zheng, Yali Li, Yi Yang, Qi Tian, and Shengjin Wang. Learning part-based convolutional features for person re-identification. IEEE transactions on pattern analysis and machine intelligence, 43 0 (3): 0 902--917, 2019

  24. [27]

    Disentangled representation learning gan for pose-invariant face recognition

    Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1415--1424, 2017

  25. [28]

    Video abstraction: A systematic review and classification

    Ba Tu Truong and Svetha Venkatesh. Video abstraction: A systematic review and classification. ACM transactions on multimedia computing, communications, and applications (TOMM), 3 0 (1): 0 3--es, 2007

  26. [29]

    Occlusion robust face recognition based on mask learning

    Weitao Wan and Jiansheng Chen. Occlusion robust face recognition based on mask learning. In 2017 IEEE international conference on image processing (ICIP), pages 3795--3799. IEEE, 2017

  27. [30]

    Learning discriminative features with multiple granularities for person re-identification

    Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia, pages 274--282, 2018 a

  28. [31]

    Cosface: Large margin cosine loss for deep face recognition

    Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5265--5274, 2018 b

  29. [34]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Arcface: Additive angular margin loss for deep face recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  30. [35]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Cosface: Large margin cosine loss for deep face recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  31. [36]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Adaface: Quality adaptive margin for face recognition , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  32. [37]

    Qwen3-Omni Technical Report

    Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=

  33. [38]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  34. [39]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

  35. [40]

    IEEE Transactions on pattern analysis and machine intelligence , volume=

    Partial face recognition: Alignment-free approach , author=. IEEE Transactions on pattern analysis and machine intelligence , volume=. 2012 , publisher=

  36. [41]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Partial fc: Training 10 million identities on a single machine , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  37. [42]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Learning part-based convolutional features for person re-identification , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2019 , publisher=

  38. [43]

    Proceedings of the 26th ACM international conference on Multimedia , pages=

    Learning discriminative features with multiple granularities for person re-identification , author=. Proceedings of the 26th ACM international conference on Multimedia , pages=

  39. [44]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Transreid: Transformer-based object re-identification , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  40. [45]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Disentangled representation learning gan for pose-invariant face recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  41. [46]

    Proceedings of the IEEE international conference on computer vision , pages=

    Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis , author=. Proceedings of the IEEE international conference on computer vision , pages=

  42. [47]

    2017 IEEE international conference on image processing (ICIP) , pages=

    Occlusion robust face recognition based on mask learning , author=. 2017 IEEE international conference on image processing (ICIP) , pages=. 2017 , organization=

  43. [48]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Facenet: A unified embedding for face recognition and clustering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  44. [49]

    2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06) , volume=

    Dimensionality reduction by learning an invariant mapping , author=. 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06) , volume=. 2006 , organization=

  45. [50]

    Proceedings of the IEEE international conference on computer vision , pages=

    No fuss distance metric learning using proxies , author=. Proceedings of the IEEE international conference on computer vision , pages=

  46. [51]

    Advances in neural information processing systems , volume=

    Supervised contrastive learning , author=. Advances in neural information processing systems , volume=

  47. [52]

    V oxceleb2: Deep speaker recognition,

    Voxceleb2: Deep speaker recognition , author=. arXiv preprint arXiv:1806.05622 , year=

  48. [53]

    2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018) , pages=

    Vggface2: A dataset for recognising faces across pose and age , author=. 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018) , pages=. 2018 , organization=

  49. [54]

    Y., Xu, Y

    Treelora: Efficient continual learning via layer-wise loras guided by a hierarchical gradient-similarity tree , author=. arXiv preprint arXiv:2506.10355 , year=

  50. [55]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Open-ended hierarchical streaming video understanding with vision language models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  51. [56]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Hier: Metric learning beyond class labels via hierarchical regularization , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  52. [57]

    European Conference on Computer Vision , pages=

    Hyperbolic metric learning for visual outlier detection , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  53. [58]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Hairnerf: Geometry-aware image synthesis for hairstyle transfer , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  54. [59]

    European Conference on Computer Vision , pages=

    Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360 ^ , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  55. [60]

    arXiv preprint arXiv:2106.11297 , year=

    Tokenlearner: What can 8 learned tokens do for images and videos? , author=. arXiv preprint arXiv:2106.11297 , year=

  56. [61]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Videomae v2: Scaling video masked autoencoders with dual masking , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  57. [62]

    Video-to-video synthesis

    Video-to-video synthesis , author=. arXiv preprint arXiv:1808.06601 , year=

  58. [63]

    IEEE Transactions on Information Forensics and Security , volume=

    Everybody’s talkin’: Let me talk as you want , author=. IEEE Transactions on Information Forensics and Security , volume=. 2022 , publisher=

  59. [64]

    Advances in neural information processing systems , volume=

    First order motion model for image animation , author=. Advances in neural information processing systems , volume=

  60. [65]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Clothes-changing person re-identification with rgb modality only , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  61. [66]

    Proceedings of the Asian conference on computer vision , year=

    Long-term cloth-changing person re-identification , author=. Proceedings of the Asian conference on computer vision , year=

  62. [67]

    ACM transactions on multimedia computing, communications, and applications (TOMM) , volume=

    Video abstraction: A systematic review and classification , author=. ACM transactions on multimedia computing, communications, and applications (TOMM) , volume=. 2007 , publisher=

  63. [68]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Retinaface: Single-shot multi-level face localisation in the wild , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  64. [69]

    European conference on computer vision , pages=

    Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=