PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training

Jia Guo; Jiankang Deng; Kaicheng Yang; Xiang An; Yin Xie; Yongle Zhao; Zeyu Xiao; Zhichao Chen; Zimin Ran; Ziyong Feng

arxiv: 2508.09691 · v3 · submitted 2025-08-13 · 💻 cs.CV

PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training

Yin Xie , Zhichao Chen , Zeyu Xiao , Yongle Zhao , Xiang An , Kaicheng Yang , Zimin Ran , Jia Guo

show 2 more authors

Ziyong Feng Jiankang Deng

This is my paper

Pith reviewed 2026-05-18 22:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords facial representation pre-trainingmasked image modelingpatch-pixel alignmentcodebook learningunsupervised pre-trainingspatial consistencyfacial analysis

0 comments

The pith

PaCo-FR pre-trains facial representations by aligning patches to pixels and enforcing spatial consistency on unlabeled images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PaCo-FR as an unsupervised pre-training method that combines masked image modeling with patch-pixel alignment to address gaps in capturing fine facial details and structure. It introduces a structured masking approach tied to meaningful facial regions, a patch-based codebook offering multiple candidate tokens, and spatial consistency constraints to maintain geometric relationships. These elements enable effective learning from only 2 million unlabeled images. If successful, the method supports stronger performance on downstream facial tasks such as recognition and expression analysis, particularly when images show pose changes, occlusions, or lighting shifts. This matters because it lowers the barrier to building capable facial analysis systems without large labeled datasets.

Core claim

PaCo-FR integrates masked image modeling with patch-pixel alignment via three components: a structured masking strategy that aligns with semantically meaningful facial regions to preserve spatial coherence, a novel patch-based codebook that enhances feature discrimination using multiple candidate tokens per patch, and spatial consistency constraints that preserve geometric relationships between facial components. The framework achieves state-of-the-art results on several facial analysis tasks after pre-training on just 2 million unlabeled images and shows particular gains in conditions involving varying poses, occlusions, and lighting.

What carries the argument

Patch-pixel aligned end-to-end codebook learning, which uses multiple candidate tokens per patch together with spatial consistency constraints to discriminate fine-grained facial features while respecting anatomical geometry.

If this is right

Better handling of real-world variations such as poses, occlusions, and lighting in facial recognition and expression tasks.
More efficient use of limited labeled data for fine-tuning on downstream facial analysis applications.
A scalable pre-training route that reduces the need for expensive annotated facial datasets.
Improved feature quality for virtual reality and other systems relying on robust facial representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The emphasis on preserving geometric relationships between facial parts could transfer to pre-training models for other structured visual domains such as medical imaging of organs.
End-to-end codebook learning with multiple tokens per patch might support further reductions in pre-training data size if tested on smaller unlabeled sets.
Combining this alignment approach with temporal data from video sequences could extend gains to dynamic facial expression analysis.
The method's data efficiency suggests potential for deployment in resource-limited settings where collecting large labeled facial corpora is impractical.

Load-bearing premise

That aligning masking to semantically meaningful facial regions plus adding spatial consistency constraints will overcome missing fine-grained semantics, ignored facial spatial structure, and inefficient use of limited labeled data.

What would settle it

Pre-train PaCo-FR on a 2-million-image unlabeled facial set, then measure accuracy on standard benchmarks for recognition or expression recognition under occlusion and pose variation; if results fall short of prior supervised or self-supervised baselines, the performance claim does not hold.

Figures

Figures reproduced from arXiv: 2508.09691 by Jia Guo, Jiankang Deng, Kaicheng Yang, Xiang An, Yin Xie, Yongle Zhao, Zeyu Xiao, Zhichao Chen, Zimin Ran, Ziyong Feng.

**Figure 1.** Figure 1: The framework of PaCo-FR incorporates an incubation stage: During the initial epoch of training, we supervise the predictions [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Cropped facial alignment results from the LAION-FACE [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The visualizations depict the impact of codebook size and different configurations on the generative capabilities of the model in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of expression reconstruction on NoW vali [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Facial representation pre-training is crucial for tasks like facial recognition, expression analysis, and virtual reality. However, existing methods face three key challenges: (1) failing to capture distinct facial features and fine-grained semantics, (2) ignoring the spatial structure inherent to facial anatomy, and (3) inefficiently utilizing limited labeled data. To overcome these, we introduce PaCo-FR, an unsupervised framework that combines masked image modeling with patch-pixel alignment. Our approach integrates three innovative components: (1) a structured masking strategy that preserves spatial coherence by aligning with semantically meaningful facial regions, (2) a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens, and (3) spatial consistency constraints that preserve geometric relationships between facial components. PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training. Our method demonstrates significant improvements, particularly in scenarios with varying poses, occlusions, and lighting conditions. We believe this work advances facial representation learning and offers a scalable, efficient solution that reduces reliance on expensive annotated datasets, driving more effective facial analysis systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaCo-FR adds region-aligned masking, a multi-token patch codebook, and spatial consistency losses to masked image modeling for faces, but the SOTA claim with 2M images needs the experiments and ablations to hold up.

read the letter

The paper introduces PaCo-FR, a masked modeling setup that tries to handle faces better than off-the-shelf MIM by aligning masks to facial regions, using a codebook with several candidate tokens per patch, and adding losses that keep spatial relationships between parts like eyes and mouth intact. These three pieces directly target the problems of weak fine-grained features, lost anatomy, and heavy label needs that the authors lay out upfront. That framing is clear and practical for anyone who has tried general self-supervised methods on face data and seen them fall short on pose or occlusion cases. The 2 million unlabeled image scale is also realistic and worth noting if it really delivers the gains. What stands out is the explicit attempt to bake facial structure into the pre-training rather than hoping a generic model picks it up. The components feel like a coherent response to the stated challenges instead of scattered add-ons. The main soft spot is the lack of visible numbers in the abstract. No deltas over MAE, BEiT, or earlier face-specific pre-training work appear, and there is no sign of ablations that isolate whether the region masking, the multi-token codebook, or the consistency term actually moves the needle. If those controls are missing or weak in the full paper, the performance edge could come from training details or data curation instead. The central claim therefore rests on unshown comparisons right now. This work is aimed at researchers doing self-supervised learning for faces or other structured domains where spatial priors matter. A reader who needs better representations for recognition or expression tasks with limited labels could find it useful once the results are checked. It deserves peer review so the experiments and ablations can be examined properly rather than desk-rejected on the abstract alone.

Referee Report

2 major / 1 minor

Summary. The paper proposes PaCo-FR, an unsupervised facial representation pre-training framework that combines masked image modeling with patch-pixel alignment. It introduces three components: a structured masking strategy aligned with semantically meaningful facial regions, a patch-based codebook using multiple candidate tokens per patch, and spatial consistency constraints to preserve geometric relationships. The central claim is that these elements overcome missing fine-grained semantics, ignored facial spatial structure, and inefficient labeled-data use, yielding state-of-the-art results on facial analysis tasks when pre-trained on only 2 million unlabeled images, with particular gains under pose, occlusion, and lighting variations.

Significance. If the performance claims are substantiated, the work would offer a practical advance in facial representation learning by demonstrating that spatially aware, codebook-based pre-training can deliver strong results with modest unlabeled data volumes. This could reduce reliance on large annotated datasets for downstream tasks such as recognition and expression analysis while improving robustness in real-world conditions.

major comments (2)

Abstract: The assertion of state-of-the-art performance across several facial analysis tasks is presented without any quantitative metrics, named baselines (e.g., MAE, BEiT, or prior facial SSL methods), ablation results, or statistical significance tests. This absence is load-bearing because the central claim rests on the three proposed components producing measurable gains; without these controls it is impossible to attribute improvements to the structured masking, multi-token codebook, or spatial consistency losses rather than training schedule or data selection.
Abstract: No experimental protocol is supplied, including dataset composition for the 2 million unlabeled images, pre-training hyperparameters, evaluation benchmarks, or implementation details for the patch-pixel alignment. This omission prevents verification of whether the method actually isolates the contribution of each component or overcomes the three stated challenges.

minor comments (1)

Abstract: The phrase 'several facial analysis tasks' is vague; specifying the exact tasks (e.g., recognition, expression recognition, landmark detection) would improve clarity and allow readers to anticipate the evaluation scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the abstract to improve clarity and substantiation of our claims while preserving its concise nature.

read point-by-point responses

Referee: Abstract: The assertion of state-of-the-art performance across several facial analysis tasks is presented without any quantitative metrics, named baselines (e.g., MAE, BEiT, or prior facial SSL methods), ablation results, or statistical significance tests. This absence is load-bearing because the central claim rests on the three proposed components producing measurable gains; without these controls it is impossible to attribute improvements to the structured masking, multi-token codebook, or spatial consistency losses rather than training schedule or data selection.

Authors: We agree that the abstract, constrained by length, does not include specific metrics or named baselines. The full manuscript contains extensive quantitative results, direct comparisons against MAE, BEiT, and prior facial SSL methods, component-wise ablations, and statistical significance testing in the Experiments section. To address the concern directly, we have revised the abstract to incorporate key performance highlights, name the primary baselines, and briefly note that gains are supported by ablations on the three proposed components. revision: yes
Referee: Abstract: No experimental protocol is supplied, including dataset composition for the 2 million unlabeled images, pre-training hyperparameters, evaluation benchmarks, or implementation details for the patch-pixel alignment. This omission prevents verification of whether the method actually isolates the contribution of each component or overcomes the three stated challenges.

Authors: The abstract is a high-level summary; the full manuscript details the experimental protocol in Sections 3 and 4, including the composition of the 2 million unlabeled images (drawn from public sources such as FFHQ with additional curation for diversity), pre-training hyperparameters, evaluation benchmarks (facial recognition, expression analysis, and robustness under pose/occlusion/lighting), and patch-pixel alignment implementation. We have revised the abstract to include a concise reference to the data scale and primary benchmarks, thereby better linking the method to the stated challenges. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper introduces PaCo-FR as a new unsupervised framework combining masked image modeling with patch-pixel alignment, using three components: structured masking aligned to facial regions, a multi-token patch codebook, and spatial consistency constraints. The abstract and description contain no equations, mathematical derivations, fitted parameters presented as predictions, or self-citations that reduce any result to the method's own inputs by construction. Performance claims (SOTA on facial tasks with 2M unlabeled images) are asserted as experimental outcomes rather than derived quantities. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the provided text. The central claims rest on the proposed architecture and training strategy, which remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that facial images contain stable semantic regions whose spatial layout can be exploited without labels, and introduces a new patch codebook entity whose benefit is asserted but not independently verified in the given text.

axioms (1)

domain assumption Facial images possess inherent spatial structure and semantically meaningful regions that can be used to guide masking.
Invoked to justify the structured masking strategy that preserves spatial coherence.

invented entities (1)

Patch-based codebook with multiple candidate tokens no independent evidence
purpose: To enhance feature discrimination by allowing several possible representations per patch.
Presented as a novel component without external evidence or prior citation in the abstract.

pith-pipeline@v0.9.0 · 5765 in / 1325 out tokens · 58883 ms · 2026-05-18T22:49:20.801036+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens... Belief Predictor... spatial consistency constraints
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 4 internal anchors

[1]

Exploring the limits of large scale pre- training

Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi. Exploring the limits of large scale pre- training. arXiv preprint arXiv:2110.02095, 2021. 1

work page arXiv 2021
[2]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Understanding human reactions looking at facial microex- pressions with an event camera

Federico Becattini, Federico Palai, and Alberto Del Bimbo. Understanding human reactions looking at facial microex- pressions with an event camera. IEEE Transactions on In- dustrial Informatics, 18(12):9112–9121, 2022. 1

work page 2022
[4]

Timo Bolkart, Tianye Li, and Michael J. Black. Instant multi-view head capture through learnable registration. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 768–779, 2023. 6

work page 2023
[5]

Pre-training strategies and datasets for facial representation learning

Adrian Bulat, Shiyang Cheng, Jing Yang, Andrew Garbett, Enrique Sanchez, and Georgios Tzimiropoulos. Pre-training strategies and datasets for facial representation learning. arXiv preprint arXiv:2103.16554, 2021. 7

work page arXiv 2021
[6]

Pre-training strategies and datasets for facial representation learning

Adrian Bulat, Shiyang Cheng, Jing Yang, Andrew Garbett, Enrique Sanchez, and Georgios Tzimiropoulos. Pre-training strategies and datasets for facial representation learning. In European Conference on Computer Vision, pages 107–125. Springer, 2022. 1, 2

work page 2022
[7]

Face alignment by explicit shape regression.International journal of computer vision, 107(2):177–190, 2014

Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression.International journal of computer vision, 107(2):177–190, 2014. 8

work page 2014
[8]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Ad- vances in neural information processing systems , 33:9912– 9924, 2020. 2

work page 2020
[9]

Face alignment with kernel density deep neural network

Lisha Chen, Hui Su, and Qiang Ji. Face alignment with kernel density deep neural network. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6992–7002, 2019. 7

work page 2019
[10]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 1, 2

work page 2020
[11]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2003
[12]

2021 , journal =

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021. 4

work page arXiv 2021
[13]

Decafa: Deep convolutional cascade for face alignment in the wild

Arnaud Dapogny, Kevin Bailly, and Matthieu Cord. Decafa: Deep convolutional cascade for face alignment in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6893–6901, 2019. 7, 8

work page 2019
[14]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4690– 4699, 2019. 6

work page 2019
[15]

Pros: Facial omni-representation learning via prototype- based self-distillation

Xing Di, Yiyu Zheng, Xiaoming Liu, and Yu Cheng. Pros: Facial omni-representation learning via prototype- based self-distillation. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 6087–6098, 2024. 2

work page 2024
[16]

Teacher supervises students how to learn from partially labeled images for facial landmark detection

Xuanyi Dong and Yi Yang. Teacher supervises students how to learn from partially labeled images for facial landmark detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 783–792, 2019. 7

work page 2019
[17]

Style aggregated network for facial landmark detection

Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial landmark detection. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 379–388, 2018. 7

work page 2018
[18]

Peco: Perceptual codebook for bert pre-training of vision transformers

Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu, and Baining Guo. Peco: Perceptual codebook for bert pre-training of vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 552–560,

work page
[19]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 4

work page 2021
[20]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2

work page 2021
[21]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021. 1, 2

work page 2021
[22]

Dynamic attention-controlled cas- caded shape regression exploiting training data augmenta- tion and fuzzy-set sample weighting

Zhen-Hua Feng, Josef Kittler, William Christmas, Patrik Hu- ber, and Xiao-Jun Wu. Dynamic attention-controlled cas- caded shape regression exploiting training data augmenta- tion and fuzzy-set sample weighting. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2481–2490, 2017. 7

work page 2017
[23]

Wing loss for robust facial landmark localisation with convolutional neural networks

Zhen-Hua Feng, Josef Kittler, Muhammad Awais, Patrik Hu- ber, and Xiao-Jun Wu. Wing loss for robust facial landmark localisation with convolutional neural networks. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2235–2245, 2018. 7, 8

work page 2018
[24]

Self-supervised facial repre- sentation learning with facial region awareness

Zheng Gao and Ioannis Patras. Self-supervised facial repre- sentation learning with facial region awareness. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2081–2092, 2024. 1

work page 2081
[25]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020. 2

work page 2020
[26]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 1, 2

work page 2020
[27]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022. 2, 3, 4

work page 2022
[28]

Adnet: Leveraging error-bias towards normal direction in face alignment

Yangyu Huang, Hao Yang, Chong Li, Jongyoo Kim, and Fangyun Wei. Adnet: Leveraging error-bias towards normal direction in face alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3080– 3090, 2021. 7, 8

work page 2021
[29]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021. 2

work page arXiv 2021
[30]

A style-based generator architecture for generative adversarial networks,

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks,

work page
[31]

Deep alignment network: A convolutional neural network for robust face alignment

Marek Kowalski, Jacek Naruniec, and Tomasz Trzcinski. Deep alignment network: A convolutional neural network for robust face alignment. InProceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 88–97, 2017. 7

work page 2017
[32]

Luvli face alignment: Esti- mating landmarks’ location, uncertainty, and visibility likeli- hood

Abhinav Kumar, Tim K Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xi- aoming Liu, and Chen Feng. Luvli face alignment: Esti- mating landmarks’ location, uncertainty, and visibility likeli- hood. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 8236–8246,

work page
[33]

Maskgan: Towards diverse and interactive facial image ma- nipulation

Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. Maskgan: Towards diverse and interactive facial image ma- nipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5549– 5558, 2020. 5

work page 2020
[34]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 6

work page 2017
[35]

A new dataset and boundary-attention semantic segmentation for face parsing

Yinglu Liu, Hailin Shi, Hao Shen, Yue Si, Xiaobo Wang, and Tao Mei. A new dataset and boundary-attention semantic segmentation for face parsing. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 11637–11644,

work page
[36]

Ehanet: An effective hierarchical aggregation network for face parsing

Ling Luo, Dingyu Xue, and Xinglong Feng. Ehanet: An effective hierarchical aggregation network for face parsing. Applied Sciences, 10(9):3135, 2020. 5, 6

work page 2020
[37]

Direct shape regression net- works for end-to-end face alignment

Xin Miao, Xiantong Zhen, Xianglong Liu, Cheng Deng, Vas- silis Athitsos, and Heng Huang. Direct shape regression net- works for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 5040–5049, 2018. 7

work page 2018
[38]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Aggregation via separation: Boosting facial land- mark detector with semi-supervised style translation

Shengju Qian, Keqiang Sun, Wayne Wu, Chen Qian, and Ji- aya Jia. Aggregation via separation: Boosting facial land- mark detector with semi-supervised style translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10153–10163, 2019. 7

work page 2019
[40]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 1, 2

work page 2021
[41]

Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. Generating 3D faces using convolutional mesh autoencoders. In European Conference on Computer Vision (ECCV), pages 725–741, 2018. 6

work page 2018
[42]

Face identity verification: Five challenges facing practitioners

David J Robertson, Matthew C Fysh, and Markus Binde- mann. Face identity verification: Five challenges facing practitioners. Keesing Journal of Documents & Identity, 59: 3–8, 2019. 1

work page 2019
[43]

Laplace landmark localization

Joseph P Robinson, Yuncheng Li, Ning Zhang, Yun Fu, and Sergey Tulyakov. Laplace landmark localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10103–10112, 2019. 7

work page 2019
[44]

300 faces in-the-wild challenge: The first facial landmark localization challenge

Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceed- ings of the IEEE International Conference on Computer Vision Workshops, pages 397–403, 2013. 5

work page 2013
[45]

A semi-automatic methodology for facial landmark annotation

Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. A semi-automatic methodology for facial landmark annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 896–903, 2013

work page 2013
[46]

300 faces in-the-wild challenge: Database and results

Christos Sagonas, Epameinondas Antonakos, Georgios Tz- imiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image and vi- sion computing, 47:3–18, 2016. 5

work page 2016
[47]

Learning to regress 3D face shape and expression from an image without 3D supervision

Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael Black. Learning to regress 3D face shape and expression from an image without 3D supervision. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 7763–7772, 2019. 6

work page 2019
[48]

Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...

work page 2022
[49]

Towards universal representa- tion learning for deep face recognition

Yichun Shi, Xiang Yu, Kihyuk Sohn, Manmohan Chan- draker, and Anil K Jain. Towards universal representa- tion learning for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6817–6826, 2020. 1

work page 2020
[50]

High-Resolution Representations for Labeling Pixels and Regions

Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019. 7

work page internal anchor Pith review Pith/arXiv arXiv 1904
[51]

Towards efficient u-nets: A coupled and quantized approach

Zhiqiang Tang, Xi Peng, Kang Li, and Dimitris N Metaxas. Towards efficient u-nets: A coupled and quantized approach. IEEE transactions on pattern analysis and machine intelli- gence, 42(8):2038–2050, 2019. 7

work page 2038
[52]

Edge- aware graph representation learning and reasoning for face parsing

Gusi Te, Yinglu Liu, Wei Hu, Hailin Shi, and Tao Mei. Edge- aware graph representation learning and reasoning for face parsing. In European Conference on Computer Vision, pages 258–274. Springer, 2020. 5, 6

work page 2020
[53]

Agr- net: Adaptive graph representation learning and reasoning for face parsing

Gusi Te, Wei Hu, Yinglu Liu, Hailin Shi, and Tao Mei. Agr- net: Adaptive graph representation learning and reasoning for face parsing. IEEE Transactions on Image Processing ,

work page
[54]

A deeply-initialized coarse-to-fine ensem- ble of regression trees for face alignment

Roberto Valle, Jose M Buenaposada, Antonio Valdes, and Luis Baumela. A deeply-initialized coarse-to-fine ensem- ble of regression trees for face alignment. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 585–601, 2018. 7

work page 2018
[55]

Adaptive wing loss for robust face alignment via heatmap regression

Xinyao Wang, Liefeng Bo, and Li Fuxin. Adaptive wing loss for robust face alignment via heatmap regression. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6971–6981, 2019. 7, 8

work page 2019
[56]

Toward high qual- ity facial representation learning

Yue Wang, Jinlong Peng, Jiangning Zhang, Ran Yi, Liang Liu, Yabiao Wang, and Chengjie Wang. Toward high qual- ity facial representation learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5048– 5058, 2023. 1, 2, 5, 6, 7, 8

work page 2023
[57]

Accurate fa- cial image parsing at real-time speed

Zhen Wei, Si Liu, Yao Sun, and Hefei Ling. Accurate fa- cial image parsing at real-time speed. IEEE Transactions on Image Processing, 28(9):4659–4670, 2019. 5, 6

work page 2019
[58]

Leveraging intra and inter- dataset variations for robust face alignment

Wenyan Wu and Shuo Yang. Leveraging intra and inter- dataset variations for robust face alignment. In Proceed- ings of the IEEE conference on computer vision and pattern recognition workshops, pages 150–159, 2017. 8

work page 2017
[59]

Look at boundary: A boundary-aware face alignment algorithm

Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2129– 2138, 2018. 5, 7, 8

work page 2018
[60]

Simmim: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022. 2

work page 2022
[61]

Supervised descent method and its applications to face alignment

Xuehan Xiong and Fernando De la Torre. Supervised descent method and its applications to face alignment. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 532–539, 2013. 8

work page 2013
[62]

Magicavatar: Multimodal avatar genera- tion and animation

Jianfeng Zhang, Hanshu Yan, Zhongcong Xu, Jiashi Feng, and Jun Hao Liew. Magicavatar: Multimodal avatar genera- tion and animation. arXiv preprint arXiv:2308.14748, 2023. 1

work page arXiv 2023
[63]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 6

work page 2017
[64]

Face recognition: A literature survey

Wenyi Zhao, Rama Chellappa, P Jonathon Phillips, and Azriel Rosenfeld. Face recognition: A literature survey. ACM computing surveys (CSUR), 35(4):399–458, 2003. 1

work page 2003
[65]

General facial representation learn- ing in a visual-linguistic manner

Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dong- dong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learn- ing in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022. 1, 2, 4, 5, 6, 7, 8

work page 2022
[66]

Face alignment by coarse-to-fine shape searching

Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4998–5006, 2015. 7, 8

work page 2015
[67]

Unconstrained face alignment via cascaded compo- sitional learning

Shizhan Zhu, Cheng Li, Chen-Change Loy, and Xiaoou Tang. Unconstrained face alignment via cascaded compo- sitional learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3409– 3417, 2016. 5, 7

work page 2016
[68]

Towards metrical reconstruction of human faces

Wojciech Zielonka, Timo Bolkart, and Justus Thies. Towards metrical reconstruction of human faces. In European Con- ference on Computer Vision, 2022. 6, 8

work page 2022
[69]

Learning robust facial landmark de- tection via hierarchical structured ensemble

Xu Zou, Sheng Zhong, Luxin Yan, Xiangyun Zhao, Jiahuan Zhou, and Ying Wu. Learning robust facial landmark de- tection via hierarchical structured ensemble. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 141–150, 2019. 7

work page 2019

[1] [1]

Exploring the limits of large scale pre- training

Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi. Exploring the limits of large scale pre- training. arXiv preprint arXiv:2110.02095, 2021. 1

work page arXiv 2021

[2] [2]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Understanding human reactions looking at facial microex- pressions with an event camera

Federico Becattini, Federico Palai, and Alberto Del Bimbo. Understanding human reactions looking at facial microex- pressions with an event camera. IEEE Transactions on In- dustrial Informatics, 18(12):9112–9121, 2022. 1

work page 2022

[4] [4]

Timo Bolkart, Tianye Li, and Michael J. Black. Instant multi-view head capture through learnable registration. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 768–779, 2023. 6

work page 2023

[5] [5]

Pre-training strategies and datasets for facial representation learning

Adrian Bulat, Shiyang Cheng, Jing Yang, Andrew Garbett, Enrique Sanchez, and Georgios Tzimiropoulos. Pre-training strategies and datasets for facial representation learning. arXiv preprint arXiv:2103.16554, 2021. 7

work page arXiv 2021

[6] [6]

Pre-training strategies and datasets for facial representation learning

Adrian Bulat, Shiyang Cheng, Jing Yang, Andrew Garbett, Enrique Sanchez, and Georgios Tzimiropoulos. Pre-training strategies and datasets for facial representation learning. In European Conference on Computer Vision, pages 107–125. Springer, 2022. 1, 2

work page 2022

[7] [7]

Face alignment by explicit shape regression.International journal of computer vision, 107(2):177–190, 2014

Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression.International journal of computer vision, 107(2):177–190, 2014. 8

work page 2014

[8] [8]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Ad- vances in neural information processing systems , 33:9912– 9924, 2020. 2

work page 2020

[9] [9]

Face alignment with kernel density deep neural network

Lisha Chen, Hui Su, and Qiang Ji. Face alignment with kernel density deep neural network. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6992–7002, 2019. 7

work page 2019

[10] [10]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 1, 2

work page 2020

[11] [11]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2003

[12] [12]

2021 , journal =

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021. 4

work page arXiv 2021

[13] [13]

Decafa: Deep convolutional cascade for face alignment in the wild

Arnaud Dapogny, Kevin Bailly, and Matthieu Cord. Decafa: Deep convolutional cascade for face alignment in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6893–6901, 2019. 7, 8

work page 2019

[14] [14]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4690– 4699, 2019. 6

work page 2019

[15] [15]

Pros: Facial omni-representation learning via prototype- based self-distillation

Xing Di, Yiyu Zheng, Xiaoming Liu, and Yu Cheng. Pros: Facial omni-representation learning via prototype- based self-distillation. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 6087–6098, 2024. 2

work page 2024

[16] [16]

Teacher supervises students how to learn from partially labeled images for facial landmark detection

Xuanyi Dong and Yi Yang. Teacher supervises students how to learn from partially labeled images for facial landmark detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 783–792, 2019. 7

work page 2019

[17] [17]

Style aggregated network for facial landmark detection

Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial landmark detection. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 379–388, 2018. 7

work page 2018

[18] [18]

Peco: Perceptual codebook for bert pre-training of vision transformers

Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu, and Baining Guo. Peco: Perceptual codebook for bert pre-training of vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 552–560,

work page

[19] [19]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 4

work page 2021

[20] [20]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2

work page 2021

[21] [21]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021. 1, 2

work page 2021

[22] [22]

Dynamic attention-controlled cas- caded shape regression exploiting training data augmenta- tion and fuzzy-set sample weighting

Zhen-Hua Feng, Josef Kittler, William Christmas, Patrik Hu- ber, and Xiao-Jun Wu. Dynamic attention-controlled cas- caded shape regression exploiting training data augmenta- tion and fuzzy-set sample weighting. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2481–2490, 2017. 7

work page 2017

[23] [23]

Wing loss for robust facial landmark localisation with convolutional neural networks

Zhen-Hua Feng, Josef Kittler, Muhammad Awais, Patrik Hu- ber, and Xiao-Jun Wu. Wing loss for robust facial landmark localisation with convolutional neural networks. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2235–2245, 2018. 7, 8

work page 2018

[24] [24]

Self-supervised facial repre- sentation learning with facial region awareness

Zheng Gao and Ioannis Patras. Self-supervised facial repre- sentation learning with facial region awareness. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2081–2092, 2024. 1

work page 2081

[25] [25]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020. 2

work page 2020

[26] [26]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 1, 2

work page 2020

[27] [27]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022. 2, 3, 4

work page 2022

[28] [28]

Adnet: Leveraging error-bias towards normal direction in face alignment

Yangyu Huang, Hao Yang, Chong Li, Jongyoo Kim, and Fangyun Wei. Adnet: Leveraging error-bias towards normal direction in face alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3080– 3090, 2021. 7, 8

work page 2021

[29] [29]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021. 2

work page arXiv 2021

[30] [30]

A style-based generator architecture for generative adversarial networks,

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks,

work page

[31] [31]

Deep alignment network: A convolutional neural network for robust face alignment

Marek Kowalski, Jacek Naruniec, and Tomasz Trzcinski. Deep alignment network: A convolutional neural network for robust face alignment. InProceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 88–97, 2017. 7

work page 2017

[32] [32]

Luvli face alignment: Esti- mating landmarks’ location, uncertainty, and visibility likeli- hood

Abhinav Kumar, Tim K Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xi- aoming Liu, and Chen Feng. Luvli face alignment: Esti- mating landmarks’ location, uncertainty, and visibility likeli- hood. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 8236–8246,

work page

[33] [33]

Maskgan: Towards diverse and interactive facial image ma- nipulation

Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. Maskgan: Towards diverse and interactive facial image ma- nipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5549– 5558, 2020. 5

work page 2020

[34] [34]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 6

work page 2017

[35] [35]

A new dataset and boundary-attention semantic segmentation for face parsing

Yinglu Liu, Hailin Shi, Hao Shen, Yue Si, Xiaobo Wang, and Tao Mei. A new dataset and boundary-attention semantic segmentation for face parsing. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 11637–11644,

work page

[36] [36]

Ehanet: An effective hierarchical aggregation network for face parsing

Ling Luo, Dingyu Xue, and Xinglong Feng. Ehanet: An effective hierarchical aggregation network for face parsing. Applied Sciences, 10(9):3135, 2020. 5, 6

work page 2020

[37] [37]

Direct shape regression net- works for end-to-end face alignment

Xin Miao, Xiantong Zhen, Xianglong Liu, Cheng Deng, Vas- silis Athitsos, and Heng Huang. Direct shape regression net- works for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 5040–5049, 2018. 7

work page 2018

[38] [38]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [39]

Aggregation via separation: Boosting facial land- mark detector with semi-supervised style translation

Shengju Qian, Keqiang Sun, Wayne Wu, Chen Qian, and Ji- aya Jia. Aggregation via separation: Boosting facial land- mark detector with semi-supervised style translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10153–10163, 2019. 7

work page 2019

[40] [40]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 1, 2

work page 2021

[41] [41]

Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. Generating 3D faces using convolutional mesh autoencoders. In European Conference on Computer Vision (ECCV), pages 725–741, 2018. 6

work page 2018

[42] [42]

Face identity verification: Five challenges facing practitioners

David J Robertson, Matthew C Fysh, and Markus Binde- mann. Face identity verification: Five challenges facing practitioners. Keesing Journal of Documents & Identity, 59: 3–8, 2019. 1

work page 2019

[43] [43]

Laplace landmark localization

Joseph P Robinson, Yuncheng Li, Ning Zhang, Yun Fu, and Sergey Tulyakov. Laplace landmark localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10103–10112, 2019. 7

work page 2019

[44] [44]

300 faces in-the-wild challenge: The first facial landmark localization challenge

Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceed- ings of the IEEE International Conference on Computer Vision Workshops, pages 397–403, 2013. 5

work page 2013

[45] [45]

A semi-automatic methodology for facial landmark annotation

Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. A semi-automatic methodology for facial landmark annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 896–903, 2013

work page 2013

[46] [46]

300 faces in-the-wild challenge: Database and results

Christos Sagonas, Epameinondas Antonakos, Georgios Tz- imiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image and vi- sion computing, 47:3–18, 2016. 5

work page 2016

[47] [47]

Learning to regress 3D face shape and expression from an image without 3D supervision

Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael Black. Learning to regress 3D face shape and expression from an image without 3D supervision. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 7763–7772, 2019. 6

work page 2019

[48] [48]

Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...

work page 2022

[49] [49]

Towards universal representa- tion learning for deep face recognition

Yichun Shi, Xiang Yu, Kihyuk Sohn, Manmohan Chan- draker, and Anil K Jain. Towards universal representa- tion learning for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6817–6826, 2020. 1

work page 2020

[50] [50]

High-Resolution Representations for Labeling Pixels and Regions

Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019. 7

work page internal anchor Pith review Pith/arXiv arXiv 1904

[51] [51]

Towards efficient u-nets: A coupled and quantized approach

Zhiqiang Tang, Xi Peng, Kang Li, and Dimitris N Metaxas. Towards efficient u-nets: A coupled and quantized approach. IEEE transactions on pattern analysis and machine intelli- gence, 42(8):2038–2050, 2019. 7

work page 2038

[52] [52]

Edge- aware graph representation learning and reasoning for face parsing

Gusi Te, Yinglu Liu, Wei Hu, Hailin Shi, and Tao Mei. Edge- aware graph representation learning and reasoning for face parsing. In European Conference on Computer Vision, pages 258–274. Springer, 2020. 5, 6

work page 2020

[53] [53]

Agr- net: Adaptive graph representation learning and reasoning for face parsing

Gusi Te, Wei Hu, Yinglu Liu, Hailin Shi, and Tao Mei. Agr- net: Adaptive graph representation learning and reasoning for face parsing. IEEE Transactions on Image Processing ,

work page

[54] [54]

A deeply-initialized coarse-to-fine ensem- ble of regression trees for face alignment

Roberto Valle, Jose M Buenaposada, Antonio Valdes, and Luis Baumela. A deeply-initialized coarse-to-fine ensem- ble of regression trees for face alignment. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 585–601, 2018. 7

work page 2018

[55] [55]

Adaptive wing loss for robust face alignment via heatmap regression

Xinyao Wang, Liefeng Bo, and Li Fuxin. Adaptive wing loss for robust face alignment via heatmap regression. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6971–6981, 2019. 7, 8

work page 2019

[56] [56]

Toward high qual- ity facial representation learning

Yue Wang, Jinlong Peng, Jiangning Zhang, Ran Yi, Liang Liu, Yabiao Wang, and Chengjie Wang. Toward high qual- ity facial representation learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5048– 5058, 2023. 1, 2, 5, 6, 7, 8

work page 2023

[57] [57]

Accurate fa- cial image parsing at real-time speed

Zhen Wei, Si Liu, Yao Sun, and Hefei Ling. Accurate fa- cial image parsing at real-time speed. IEEE Transactions on Image Processing, 28(9):4659–4670, 2019. 5, 6

work page 2019

[58] [58]

Leveraging intra and inter- dataset variations for robust face alignment

Wenyan Wu and Shuo Yang. Leveraging intra and inter- dataset variations for robust face alignment. In Proceed- ings of the IEEE conference on computer vision and pattern recognition workshops, pages 150–159, 2017. 8

work page 2017

[59] [59]

Look at boundary: A boundary-aware face alignment algorithm

Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2129– 2138, 2018. 5, 7, 8

work page 2018

[60] [60]

Simmim: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022. 2

work page 2022

[61] [61]

Supervised descent method and its applications to face alignment

Xuehan Xiong and Fernando De la Torre. Supervised descent method and its applications to face alignment. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 532–539, 2013. 8

work page 2013

[62] [62]

Magicavatar: Multimodal avatar genera- tion and animation

Jianfeng Zhang, Hanshu Yan, Zhongcong Xu, Jiashi Feng, and Jun Hao Liew. Magicavatar: Multimodal avatar genera- tion and animation. arXiv preprint arXiv:2308.14748, 2023. 1

work page arXiv 2023

[63] [63]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 6

work page 2017

[64] [64]

Face recognition: A literature survey

Wenyi Zhao, Rama Chellappa, P Jonathon Phillips, and Azriel Rosenfeld. Face recognition: A literature survey. ACM computing surveys (CSUR), 35(4):399–458, 2003. 1

work page 2003

[65] [65]

General facial representation learn- ing in a visual-linguistic manner

Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dong- dong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learn- ing in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022. 1, 2, 4, 5, 6, 7, 8

work page 2022

[66] [66]

Face alignment by coarse-to-fine shape searching

Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4998–5006, 2015. 7, 8

work page 2015

[67] [67]

Unconstrained face alignment via cascaded compo- sitional learning

Shizhan Zhu, Cheng Li, Chen-Change Loy, and Xiaoou Tang. Unconstrained face alignment via cascaded compo- sitional learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3409– 3417, 2016. 5, 7

work page 2016

[68] [68]

Towards metrical reconstruction of human faces

Wojciech Zielonka, Timo Bolkart, and Justus Thies. Towards metrical reconstruction of human faces. In European Con- ference on Computer Vision, 2022. 6, 8

work page 2022

[69] [69]

Learning robust facial landmark de- tection via hierarchical structured ensemble

Xu Zou, Sheng Zhong, Luxin Yan, Xiangyun Zhao, Jiahuan Zhou, and Ying Wu. Learning robust facial landmark de- tection via hierarchical structured ensemble. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 141–150, 2019. 7

work page 2019