PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training
Pith reviewed 2026-05-18 22:49 UTC · model grok-4.3
The pith
PaCo-FR pre-trains facial representations by aligning patches to pixels and enforcing spatial consistency on unlabeled images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaCo-FR integrates masked image modeling with patch-pixel alignment via three components: a structured masking strategy that aligns with semantically meaningful facial regions to preserve spatial coherence, a novel patch-based codebook that enhances feature discrimination using multiple candidate tokens per patch, and spatial consistency constraints that preserve geometric relationships between facial components. The framework achieves state-of-the-art results on several facial analysis tasks after pre-training on just 2 million unlabeled images and shows particular gains in conditions involving varying poses, occlusions, and lighting.
What carries the argument
Patch-pixel aligned end-to-end codebook learning, which uses multiple candidate tokens per patch together with spatial consistency constraints to discriminate fine-grained facial features while respecting anatomical geometry.
If this is right
- Better handling of real-world variations such as poses, occlusions, and lighting in facial recognition and expression tasks.
- More efficient use of limited labeled data for fine-tuning on downstream facial analysis applications.
- A scalable pre-training route that reduces the need for expensive annotated facial datasets.
- Improved feature quality for virtual reality and other systems relying on robust facial representations.
Where Pith is reading between the lines
- The emphasis on preserving geometric relationships between facial parts could transfer to pre-training models for other structured visual domains such as medical imaging of organs.
- End-to-end codebook learning with multiple tokens per patch might support further reductions in pre-training data size if tested on smaller unlabeled sets.
- Combining this alignment approach with temporal data from video sequences could extend gains to dynamic facial expression analysis.
- The method's data efficiency suggests potential for deployment in resource-limited settings where collecting large labeled facial corpora is impractical.
Load-bearing premise
That aligning masking to semantically meaningful facial regions plus adding spatial consistency constraints will overcome missing fine-grained semantics, ignored facial spatial structure, and inefficient use of limited labeled data.
What would settle it
Pre-train PaCo-FR on a 2-million-image unlabeled facial set, then measure accuracy on standard benchmarks for recognition or expression recognition under occlusion and pose variation; if results fall short of prior supervised or self-supervised baselines, the performance claim does not hold.
Figures
read the original abstract
Facial representation pre-training is crucial for tasks like facial recognition, expression analysis, and virtual reality. However, existing methods face three key challenges: (1) failing to capture distinct facial features and fine-grained semantics, (2) ignoring the spatial structure inherent to facial anatomy, and (3) inefficiently utilizing limited labeled data. To overcome these, we introduce PaCo-FR, an unsupervised framework that combines masked image modeling with patch-pixel alignment. Our approach integrates three innovative components: (1) a structured masking strategy that preserves spatial coherence by aligning with semantically meaningful facial regions, (2) a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens, and (3) spatial consistency constraints that preserve geometric relationships between facial components. PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training. Our method demonstrates significant improvements, particularly in scenarios with varying poses, occlusions, and lighting conditions. We believe this work advances facial representation learning and offers a scalable, efficient solution that reduces reliance on expensive annotated datasets, driving more effective facial analysis systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PaCo-FR, an unsupervised facial representation pre-training framework that combines masked image modeling with patch-pixel alignment. It introduces three components: a structured masking strategy aligned with semantically meaningful facial regions, a patch-based codebook using multiple candidate tokens per patch, and spatial consistency constraints to preserve geometric relationships. The central claim is that these elements overcome missing fine-grained semantics, ignored facial spatial structure, and inefficient labeled-data use, yielding state-of-the-art results on facial analysis tasks when pre-trained on only 2 million unlabeled images, with particular gains under pose, occlusion, and lighting variations.
Significance. If the performance claims are substantiated, the work would offer a practical advance in facial representation learning by demonstrating that spatially aware, codebook-based pre-training can deliver strong results with modest unlabeled data volumes. This could reduce reliance on large annotated datasets for downstream tasks such as recognition and expression analysis while improving robustness in real-world conditions.
major comments (2)
- Abstract: The assertion of state-of-the-art performance across several facial analysis tasks is presented without any quantitative metrics, named baselines (e.g., MAE, BEiT, or prior facial SSL methods), ablation results, or statistical significance tests. This absence is load-bearing because the central claim rests on the three proposed components producing measurable gains; without these controls it is impossible to attribute improvements to the structured masking, multi-token codebook, or spatial consistency losses rather than training schedule or data selection.
- Abstract: No experimental protocol is supplied, including dataset composition for the 2 million unlabeled images, pre-training hyperparameters, evaluation benchmarks, or implementation details for the patch-pixel alignment. This omission prevents verification of whether the method actually isolates the contribution of each component or overcomes the three stated challenges.
minor comments (1)
- Abstract: The phrase 'several facial analysis tasks' is vague; specifying the exact tasks (e.g., recognition, expression recognition, landmark detection) would improve clarity and allow readers to anticipate the evaluation scope.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the abstract to improve clarity and substantiation of our claims while preserving its concise nature.
read point-by-point responses
-
Referee: Abstract: The assertion of state-of-the-art performance across several facial analysis tasks is presented without any quantitative metrics, named baselines (e.g., MAE, BEiT, or prior facial SSL methods), ablation results, or statistical significance tests. This absence is load-bearing because the central claim rests on the three proposed components producing measurable gains; without these controls it is impossible to attribute improvements to the structured masking, multi-token codebook, or spatial consistency losses rather than training schedule or data selection.
Authors: We agree that the abstract, constrained by length, does not include specific metrics or named baselines. The full manuscript contains extensive quantitative results, direct comparisons against MAE, BEiT, and prior facial SSL methods, component-wise ablations, and statistical significance testing in the Experiments section. To address the concern directly, we have revised the abstract to incorporate key performance highlights, name the primary baselines, and briefly note that gains are supported by ablations on the three proposed components. revision: yes
-
Referee: Abstract: No experimental protocol is supplied, including dataset composition for the 2 million unlabeled images, pre-training hyperparameters, evaluation benchmarks, or implementation details for the patch-pixel alignment. This omission prevents verification of whether the method actually isolates the contribution of each component or overcomes the three stated challenges.
Authors: The abstract is a high-level summary; the full manuscript details the experimental protocol in Sections 3 and 4, including the composition of the 2 million unlabeled images (drawn from public sources such as FFHQ with additional curation for diversity), pre-training hyperparameters, evaluation benchmarks (facial recognition, expression analysis, and robustness under pose/occlusion/lighting), and patch-pixel alignment implementation. We have revised the abstract to include a concise reference to the data scale and primary benchmarks, thereby better linking the method to the stated challenges. revision: yes
Circularity Check
No circularity detected in derivation or claims
full rationale
The paper introduces PaCo-FR as a new unsupervised framework combining masked image modeling with patch-pixel alignment, using three components: structured masking aligned to facial regions, a multi-token patch codebook, and spatial consistency constraints. The abstract and description contain no equations, mathematical derivations, fitted parameters presented as predictions, or self-citations that reduce any result to the method's own inputs by construction. Performance claims (SOTA on facial tasks with 2M unlabeled images) are asserted as experimental outcomes rather than derived quantities. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the provided text. The central claims rest on the proposed architecture and training strategy, which remain independent of any circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Facial images possess inherent spatial structure and semantically meaningful regions that can be used to guide masking.
invented entities (1)
-
Patch-based codebook with multiple candidate tokens
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens... Belief Predictor... spatial consistency constraints
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Exploring the limits of large scale pre- training
Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi. Exploring the limits of large scale pre- training. arXiv preprint arXiv:2110.02095, 2021. 1
-
[2]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Understanding human reactions looking at facial microex- pressions with an event camera
Federico Becattini, Federico Palai, and Alberto Del Bimbo. Understanding human reactions looking at facial microex- pressions with an event camera. IEEE Transactions on In- dustrial Informatics, 18(12):9112–9121, 2022. 1
work page 2022
-
[4]
Timo Bolkart, Tianye Li, and Michael J. Black. Instant multi-view head capture through learnable registration. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 768–779, 2023. 6
work page 2023
-
[5]
Pre-training strategies and datasets for facial representation learning
Adrian Bulat, Shiyang Cheng, Jing Yang, Andrew Garbett, Enrique Sanchez, and Georgios Tzimiropoulos. Pre-training strategies and datasets for facial representation learning. arXiv preprint arXiv:2103.16554, 2021. 7
-
[6]
Pre-training strategies and datasets for facial representation learning
Adrian Bulat, Shiyang Cheng, Jing Yang, Andrew Garbett, Enrique Sanchez, and Georgios Tzimiropoulos. Pre-training strategies and datasets for facial representation learning. In European Conference on Computer Vision, pages 107–125. Springer, 2022. 1, 2
work page 2022
-
[7]
Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression.International journal of computer vision, 107(2):177–190, 2014. 8
work page 2014
-
[8]
Unsupervised learning of visual features by contrasting cluster assignments
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Ad- vances in neural information processing systems , 33:9912– 9924, 2020. 2
work page 2020
-
[9]
Face alignment with kernel density deep neural network
Lisha Chen, Hui Su, and Qiang Ji. Face alignment with kernel density deep neural network. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6992–7002, 2019. 7
work page 2019
-
[10]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 1, 2
work page 2020
-
[11]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[12]
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021. 4
-
[13]
Decafa: Deep convolutional cascade for face alignment in the wild
Arnaud Dapogny, Kevin Bailly, and Matthieu Cord. Decafa: Deep convolutional cascade for face alignment in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6893–6901, 2019. 7, 8
work page 2019
-
[14]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4690– 4699, 2019. 6
work page 2019
-
[15]
Pros: Facial omni-representation learning via prototype- based self-distillation
Xing Di, Yiyu Zheng, Xiaoming Liu, and Yu Cheng. Pros: Facial omni-representation learning via prototype- based self-distillation. In Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision , pages 6087–6098, 2024. 2
work page 2024
-
[16]
Teacher supervises students how to learn from partially labeled images for facial landmark detection
Xuanyi Dong and Yi Yang. Teacher supervises students how to learn from partially labeled images for facial landmark detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 783–792, 2019. 7
work page 2019
-
[17]
Style aggregated network for facial landmark detection
Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial landmark detection. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 379–388, 2018. 7
work page 2018
-
[18]
Peco: Perceptual codebook for bert pre-training of vision transformers
Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu, and Baining Guo. Peco: Perceptual codebook for bert pre-training of vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 552–560,
-
[19]
An image is worth 16x16 words: Transformers for image recognition at scale, 2021
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 4
work page 2021
-
[20]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 2
work page 2021
-
[21]
Multiscale vision transformers
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021. 1, 2
work page 2021
-
[22]
Zhen-Hua Feng, Josef Kittler, William Christmas, Patrik Hu- ber, and Xiao-Jun Wu. Dynamic attention-controlled cas- caded shape regression exploiting training data augmenta- tion and fuzzy-set sample weighting. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2481–2490, 2017. 7
work page 2017
-
[23]
Wing loss for robust facial landmark localisation with convolutional neural networks
Zhen-Hua Feng, Josef Kittler, Muhammad Awais, Patrik Hu- ber, and Xiao-Jun Wu. Wing loss for robust facial landmark localisation with convolutional neural networks. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2235–2245, 2018. 7, 8
work page 2018
-
[24]
Self-supervised facial repre- sentation learning with facial region awareness
Zheng Gao and Ioannis Patras. Self-supervised facial repre- sentation learning with facial region awareness. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2081–2092, 2024. 1
work page 2081
-
[25]
Bootstrap your own latent-a new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020. 2
work page 2020
-
[26]
Momentum contrast for unsupervised visual rep- resentation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9729–9738, 2020. 1, 2
work page 2020
-
[27]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022. 2, 3, 4
work page 2022
-
[28]
Adnet: Leveraging error-bias towards normal direction in face alignment
Yangyu Huang, Hao Yang, Chong Li, Jongyoo Kim, and Fangyun Wei. Adnet: Leveraging error-bias towards normal direction in face alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3080– 3090, 2021. 7, 8
work page 2021
-
[29]
Le, Yunhsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021. 2
-
[30]
A style-based generator architecture for generative adversarial networks,
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks,
-
[31]
Deep alignment network: A convolutional neural network for robust face alignment
Marek Kowalski, Jacek Naruniec, and Tomasz Trzcinski. Deep alignment network: A convolutional neural network for robust face alignment. InProceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 88–97, 2017. 7
work page 2017
-
[32]
Luvli face alignment: Esti- mating landmarks’ location, uncertainty, and visibility likeli- hood
Abhinav Kumar, Tim K Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xi- aoming Liu, and Chen Feng. Luvli face alignment: Esti- mating landmarks’ location, uncertainty, and visibility likeli- hood. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 8236–8246,
-
[33]
Maskgan: Towards diverse and interactive facial image ma- nipulation
Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. Maskgan: Towards diverse and interactive facial image ma- nipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5549– 5558, 2020. 5
work page 2020
-
[34]
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 6
work page 2017
-
[35]
A new dataset and boundary-attention semantic segmentation for face parsing
Yinglu Liu, Hailin Shi, Hao Shen, Yue Si, Xiaobo Wang, and Tao Mei. A new dataset and boundary-attention semantic segmentation for face parsing. In Proceedings of the AAAI Conference on Artificial Intelligence , pages 11637–11644,
-
[36]
Ehanet: An effective hierarchical aggregation network for face parsing
Ling Luo, Dingyu Xue, and Xinglong Feng. Ehanet: An effective hierarchical aggregation network for face parsing. Applied Sciences, 10(9):3135, 2020. 5, 6
work page 2020
-
[37]
Direct shape regression net- works for end-to-end face alignment
Xin Miao, Xiantong Zhen, Xianglong Liu, Cheng Deng, Vas- silis Athitsos, and Heng Huang. Direct shape regression net- works for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 5040–5049, 2018. 7
work page 2018
-
[38]
Neural Discrete Representation Learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
Shengju Qian, Keqiang Sun, Wayne Wu, Chen Qian, and Ji- aya Jia. Aggregation via separation: Boosting facial land- mark detector with semi-supervised style translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10153–10163, 2019. 7
work page 2019
-
[40]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 1, 2
work page 2021
-
[41]
Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. Generating 3D faces using convolutional mesh autoencoders. In European Conference on Computer Vision (ECCV), pages 725–741, 2018. 6
work page 2018
-
[42]
Face identity verification: Five challenges facing practitioners
David J Robertson, Matthew C Fysh, and Markus Binde- mann. Face identity verification: Five challenges facing practitioners. Keesing Journal of Documents & Identity, 59: 3–8, 2019. 1
work page 2019
-
[43]
Joseph P Robinson, Yuncheng Li, Ning Zhang, Yun Fu, and Sergey Tulyakov. Laplace landmark localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10103–10112, 2019. 7
work page 2019
-
[44]
300 faces in-the-wild challenge: The first facial landmark localization challenge
Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceed- ings of the IEEE International Conference on Computer Vision Workshops, pages 397–403, 2013. 5
work page 2013
-
[45]
A semi-automatic methodology for facial landmark annotation
Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. A semi-automatic methodology for facial landmark annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 896–903, 2013
work page 2013
-
[46]
300 faces in-the-wild challenge: Database and results
Christos Sagonas, Epameinondas Antonakos, Georgios Tz- imiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image and vi- sion computing, 47:3–18, 2016. 5
work page 2016
-
[47]
Learning to regress 3D face shape and expression from an image without 3D supervision
Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael Black. Learning to regress 3D face shape and expression from an image without 3D supervision. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 7763–7772, 2019. 6
work page 2019
-
[48]
Laion-5b: An open large-scale dataset for training next generation image-text models, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...
work page 2022
-
[49]
Towards universal representa- tion learning for deep face recognition
Yichun Shi, Xiang Yu, Kihyuk Sohn, Manmohan Chan- draker, and Anil K Jain. Towards universal representa- tion learning for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6817–6826, 2020. 1
work page 2020
-
[50]
High-Resolution Representations for Labeling Pixels and Regions
Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019. 7
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[51]
Towards efficient u-nets: A coupled and quantized approach
Zhiqiang Tang, Xi Peng, Kang Li, and Dimitris N Metaxas. Towards efficient u-nets: A coupled and quantized approach. IEEE transactions on pattern analysis and machine intelli- gence, 42(8):2038–2050, 2019. 7
work page 2038
-
[52]
Edge- aware graph representation learning and reasoning for face parsing
Gusi Te, Yinglu Liu, Wei Hu, Hailin Shi, and Tao Mei. Edge- aware graph representation learning and reasoning for face parsing. In European Conference on Computer Vision, pages 258–274. Springer, 2020. 5, 6
work page 2020
-
[53]
Agr- net: Adaptive graph representation learning and reasoning for face parsing
Gusi Te, Wei Hu, Yinglu Liu, Hailin Shi, and Tao Mei. Agr- net: Adaptive graph representation learning and reasoning for face parsing. IEEE Transactions on Image Processing ,
-
[54]
A deeply-initialized coarse-to-fine ensem- ble of regression trees for face alignment
Roberto Valle, Jose M Buenaposada, Antonio Valdes, and Luis Baumela. A deeply-initialized coarse-to-fine ensem- ble of regression trees for face alignment. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 585–601, 2018. 7
work page 2018
-
[55]
Adaptive wing loss for robust face alignment via heatmap regression
Xinyao Wang, Liefeng Bo, and Li Fuxin. Adaptive wing loss for robust face alignment via heatmap regression. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6971–6981, 2019. 7, 8
work page 2019
-
[56]
Toward high qual- ity facial representation learning
Yue Wang, Jinlong Peng, Jiangning Zhang, Ran Yi, Liang Liu, Yabiao Wang, and Chengjie Wang. Toward high qual- ity facial representation learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5048– 5058, 2023. 1, 2, 5, 6, 7, 8
work page 2023
-
[57]
Accurate fa- cial image parsing at real-time speed
Zhen Wei, Si Liu, Yao Sun, and Hefei Ling. Accurate fa- cial image parsing at real-time speed. IEEE Transactions on Image Processing, 28(9):4659–4670, 2019. 5, 6
work page 2019
-
[58]
Leveraging intra and inter- dataset variations for robust face alignment
Wenyan Wu and Shuo Yang. Leveraging intra and inter- dataset variations for robust face alignment. In Proceed- ings of the IEEE conference on computer vision and pattern recognition workshops, pages 150–159, 2017. 8
work page 2017
-
[59]
Look at boundary: A boundary-aware face alignment algorithm
Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2129– 2138, 2018. 5, 7, 8
work page 2018
-
[60]
Simmim: A simple framework for masked image modeling
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022. 2
work page 2022
-
[61]
Supervised descent method and its applications to face alignment
Xuehan Xiong and Fernando De la Torre. Supervised descent method and its applications to face alignment. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 532–539, 2013. 8
work page 2013
-
[62]
Magicavatar: Multimodal avatar genera- tion and animation
Jianfeng Zhang, Hanshu Yan, Zhongcong Xu, Jiashi Feng, and Jun Hao Liew. Magicavatar: Multimodal avatar genera- tion and animation. arXiv preprint arXiv:2308.14748, 2023. 1
-
[63]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 6
work page 2017
-
[64]
Face recognition: A literature survey
Wenyi Zhao, Rama Chellappa, P Jonathon Phillips, and Azriel Rosenfeld. Face recognition: A literature survey. ACM computing surveys (CSUR), 35(4):399–458, 2003. 1
work page 2003
-
[65]
General facial representation learn- ing in a visual-linguistic manner
Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dong- dong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learn- ing in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022. 1, 2, 4, 5, 6, 7, 8
work page 2022
-
[66]
Face alignment by coarse-to-fine shape searching
Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4998–5006, 2015. 7, 8
work page 2015
-
[67]
Unconstrained face alignment via cascaded compo- sitional learning
Shizhan Zhu, Cheng Li, Chen-Change Loy, and Xiaoou Tang. Unconstrained face alignment via cascaded compo- sitional learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3409– 3417, 2016. 5, 7
work page 2016
-
[68]
Towards metrical reconstruction of human faces
Wojciech Zielonka, Timo Bolkart, and Justus Thies. Towards metrical reconstruction of human faces. In European Con- ference on Computer Vision, 2022. 6, 8
work page 2022
-
[69]
Learning robust facial landmark de- tection via hierarchical structured ensemble
Xu Zou, Sheng Zhong, Luxin Yan, Xiangyun Zhao, Jiahuan Zhou, and Ying Wu. Learning robust facial landmark de- tection via hierarchical structured ensemble. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 141–150, 2019. 7
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.