Recognition: unknown
UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection
Pith reviewed 2026-05-09 22:35 UTC · model grok-4.3
The pith
UAU-Net improves facial action unit detection by explicitly modeling uncertainty in both feature representation and classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UAU-Net is an uncertainty-aware framework for AU detection. At the representation stage, the CV-AFE module uses a conditional VAE to learn AU feature means and variances across scales, conditioned on labels to capture inter-AU dependencies. At the decision stage, the AB-ENN uses Beta distributions to parameterize uncertainty and an asymmetric loss to handle label imbalance, reducing overconfident predictions.
What carries the argument
CV-AFE, a conditional VAE-based module that learns probabilistic AU representations by estimating means and variances conditioned on labels, and AB-ENN, an asymmetric Beta evidential neural network that models predictive uncertainty with Beta distributions and applies an asymmetric loss for imbalanced multi-label detection.
If this is right
- Improved robustness to visual noise and subject variations through variance estimation in features.
- More reliable confidence estimates that avoid overconfidence on ambiguous or imbalanced labels.
- Enhanced capture of inter-AU relationships via label conditioning in the representation module.
- Overall higher F1 scores and better calibration metrics on standard AU detection benchmarks.
Where Pith is reading between the lines
- Extending this dual uncertainty modeling to other computer vision tasks involving ambiguous labels could yield similar gains in reliability.
- Integration with temporal modeling might further address dynamic AU sequences in video.
- Testing on datasets with different imbalance ratios would help confirm the asymmetric loss's effectiveness beyond BP4D and DISFA.
Load-bearing premise
The introduced CV-AFE and AB-ENN modules effectively capture the heterogeneous AU-specific uncertainties without introducing artifacts or overfitting to the BP4D and DISFA datasets.
What would settle it
Evaluating UAU-Net on a new AU dataset with substantially different visual conditions, label distributions, or noise levels, and finding no improvement in performance or calibration compared to deterministic baselines would falsify the central claim.
Figures
read the original abstract
Facial action unit (AU) detection remains challenging because it involves heterogeneous, AU-specific uncertainties arising at both the representation and decision stages. Recent methods have improved discriminative feature learning, but they often treat the AU representations as deterministic, overlooking uncertainty caused by visual noise, subject-dependent appearance variations, and ambiguous inter-AU relationships, all of which can substantially degrade robustness. Meanwhile, conventional point-estimation classifiers often provide poorly calibrated confidence, producing overconfident predictions, especially under the severe label imbalance typical of AU datasets. We propose UAU-Net, an Uncertainty-aware AU detection framework that explicitly models uncertainty at both stages. At the representation stage, we introduce CV-AFE, a conditional VAE (CVAE)-based AU feature extraction module that learns probabilistic AU representations by jointly estimating feature means and variances across multiple spatio-temporal scales; conditioning on AU labels further enables CV-AFE to capture uncertainty associated with inter-AU dependencies. At the decision stage, we design AB-ENN, an Asymmetric Beta Evidential Neural Network for multi-label AU detection, which parameterizes predictive uncertainty with Beta distributions and mitigates overconfidence via an asymmetric loss tailored to highly imbalanced binary labels. Extensive experiments on BP4D and DISFA show that UAU-Net achieves strong AU detection performance, and further analyses indicate that modeling uncertainty in both representation learning and evidential prediction improves robustness and reliability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes UAU-Net, an uncertainty-aware framework for facial action unit (AU) detection. It introduces CV-AFE, a conditional VAE-based module that learns probabilistic AU representations by estimating means and variances across spatio-temporal scales while conditioning on AU labels to capture inter-AU dependencies and uncertainties from noise/variations. At the decision stage, AB-ENN parameterizes predictive uncertainty via Beta distributions and employs an asymmetric loss to address overconfidence under severe label imbalance. Experiments on BP4D and DISFA are claimed to show strong detection performance with further analyses indicating that joint uncertainty modeling at both stages improves robustness and reliability.
Significance. If the quantitative claims hold, the work has moderate significance for affective computing and computer vision. Explicitly modeling heterogeneous uncertainties in both representation learning and evidential classification addresses a recognized limitation of deterministic AU detectors. The combination of CVAE conditioning and asymmetric Beta evidential networks is a coherent technical contribution that could improve calibration and robustness in imbalanced, ambiguous settings typical of real-world facial analysis.
major comments (3)
- [Abstract] Abstract: the central claim of 'strong AU detection performance' and 'improved robustness and reliability' is not supported by any numbers, baselines, ablation tables, or statistical tests in the provided abstract; without these the magnitude and reliability of the gains cannot be assessed.
- [Section 4] Section 4 (Experiments): the claim that uncertainty modeling improves performance rests on results from BP4D and DISFA, yet no mention is made of multiple random seeds, standard deviations, or significance testing (e.g., paired t-tests against baselines); this undermines the assertion that the gains are reproducible and not due to initialization or dataset-specific tuning.
- [Section 3.2] Section 3.2 (AB-ENN): the asymmetric loss is presented as essential for mitigating overconfidence on imbalanced binary labels, but the manuscript does not report an ablation replacing it with a symmetric evidential loss; without this comparison it is unclear whether the asymmetry is load-bearing for the reliability improvements.
minor comments (2)
- [Section 3.1] The description of CV-AFE conditioning on AU labels should explicitly state whether label information is available at inference time or only during training, as this affects practical deployment.
- [Figure 1] Figure 1 (architecture overview) would benefit from clearer annotation of the variance outputs from CV-AFE and the Beta parameters from AB-ENN to help readers trace the uncertainty flow.
Circularity Check
No circularity in the proposed UAU-Net framework
full rationale
The paper introduces two new modules, CV-AFE (conditional VAE-based feature extractor) and AB-ENN (asymmetric beta evidential neural network), as independent architectural contributions for modeling uncertainty at representation and decision stages. These are not defined in terms of the target AU detection performance or fitted parameters that are then renamed as predictions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work are present in the abstract or described contributions. Performance claims rest on experiments on BP4D and DISFA datasets rather than reducing to the inputs by construction. The derivation chain is self-contained with external empirical validation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
R Rashmi Adyapady and B Annappa. 2023. A comprehensive review of facial expression recognition techniques.Multimedia Systems29, 1 (2023), 73–103. UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands
2023
-
[2]
Jiyuan Cao, Zhilei Liu, and Yong Zhang. 2022. Cross-subject Action Unit Detection with Meta Learning and Transformer-based Relation Modeling. InInternational Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 2022. IEEE, 1–8
2022
-
[3]
Jie Chang, Zhonghao Lan, Changmao Cheng, and Yichen Wei. 2020. Data uncer- tainty learning in face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5710–5719
2020
-
[4]
Yanan Chang, Caichao Zhang, Yi Wu, and Shangfei Wang. 2024. Facial Action Unit Recognition Enhanced by Text Descriptions of FACS.IEEE Transactions on Affective Computing(2024)
2024
-
[5]
Lei Chen, Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen Lu. 2022. Uncertainty- Aware Representation Learning for Action Segmentation.. InIJCAI, Vol. 2. 6
2022
-
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255
2009
-
[7]
Paul Ekman and Wallace V Friesen. 1978. Facial action coding system.Environ- mental Psychology & Nonverbal Behavior(1978)
1978
-
[8]
Zongbo Han, Changqing Zhang, Huazhu Fu, and Joey Tianyi Zhou. 2022. Trusted multi-view classification with dynamic evidential fusion.IEEE transactions on pattern analysis and machine intelligence45, 2 (2022), 2551–2566
2022
-
[9]
Wenchong He, Zhe Jiang, Tingsong Xiao, Zelin Xu, and Yukun Li. 2025. A survey on uncertainty quantification methods for deep learning.Comput. Surveys(2025)
2025
-
[10]
Geethu Miriam Jacob and Bjorn Stenger. 2021. Facial action unit detection with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7680–7689
2021
-
[11]
Kingma and Max Welling
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings
2014
-
[12]
Simon Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R Ledsam, Klaus Maier-Hein, SM Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. 2018. A probabilistic u-net for segmentation of ambiguous images.Advances in neural information processing systems31 (2018)
2018
-
[13]
Guanbin Li, Xin Zhu, Yirui Zeng, Qing Wang, and Liang Lin. 2019. Semantic relationships guided representation learning for facial action unit recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8594–8601
2019
-
[14]
Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. 2018. Eac-net: Deep nets with enhancing and cropping for facial action unit detection.IEEE transactions on pattern analysis and machine intelligence40, 11 (2018), 2583–2596
2018
-
[15]
attention
Xiaotian Li, Zhihua Li, Huiyuan Yang, Geran Zhao, and Lijun Yin. 2021. Your “attention” deserves attention: A self-diversified multi-channel attention for facial action analysis. In2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). IEEE, 01–08
2021
-
[16]
Xiaotian Li, Xiang Zhang, Taoyue Wang, and Lijun Yin. 2023. Knowledge- spreader: Learning semi-supervised facial action dynamics by consistifying knowledge granularity. InProceedings of the IEEE/CVF international conference on computer vision. 20979–20989
2023
-
[17]
Yante Li, Jinsheng Wei, Yang Liu, Janne Kauttonen, and Guoying Zhao. 2022. Deep learning for micro-expression recognition: A survey.IEEE Transactions on Affective Computing13, 4 (2022), 2028–2046
2022
-
[18]
Zhihua Li, Xiang Deng, Xiaotian Li, and Lijun Yin. 2021. Integrating semantic and temporal relationships in facial action unit detection. InProceedings of the 29th ACM international conference on multimedia. 5519–5527
2021
-
[19]
Xin Liu, Kaishen Yuan, Xuesong Niu, Jingang Shi, Zitong Yu, Huanjing Yue, and Jingyu Yang. 2024. Multi-scale promoted self-adjusting correlation learning for facial action unit detection.IEEE Transactions on Affective Computing(2024)
2024
-
[20]
Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations
2018
-
[21]
Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. 2022. Learning Multi-dimensional Edge Feature-based AU Relation Graph for Facial Action Unit Recognition. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence ...
2022
-
[22]
S Mohammad Mavadati, Mohammad H Mahoor, Kevin Bartlett, Philip Trinh, and Jeffrey F Cohn. 2013. Disfa: A spontaneous facial action intensity database.IEEE Transactions on Affective Computing4, 2 (2013), 151–160
2013
-
[23]
Xuesong Niu, Hu Han, Songfan Yang, Yan Huang, and Shiguang Shan. 2019. Local relationship learning with person-specific shape regularization for facial action unit detection. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 11917–11926
2019
-
[24]
Yang Qin, Dezhong Peng, Xi Peng, Xu Wang, and Peng Hu. 2022. Deep evidential learning with noisy correspondence for cross-modal retrieval. InProceedings of the 30th ACM International Conference on Multimedia. 4948–4956
2022
-
[25]
Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. 2021. Asymmetric Loss For Multi-Label Classification. In2021 IEEE/CVF International Conference on Computer Vision (ICCV). 82–91
2021
-
[26]
Murat Sensoy, Lance Kaplan, and Melih Kandemir. 2018. Evidential deep learning to quantify classification uncertainty.Advances in neural information processing systems31 (2018)
2018
-
[27]
Ziqiao Shang, Bin Liu, Fengmao Lv, Fei Teng, Tianrui Li, and Lan-Zhe Guo
-
[28]
Pattern Recognition173 (2026), 112746
Learning contrastive feature representations for facial action unit detection. Pattern Recognition173 (2026), 112746
2026
-
[29]
Zhiwen Shao, Bikuan Chen, Yong Zhou, Xuehuai Shi, Canlin Li, Lizhuang Ma, and Dit-Yan Yeung. 2026. Constrained and directional ensemble attention for facial action unit detection.Pattern Recognition169 (2026), 111904
2026
-
[30]
Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. 2021. Jaa-net: joint facial action unit detection and face alignment via adaptive attention.International Journal of Computer Vision129, 2 (2021), 321–340
2021
-
[31]
Zhiwen Shao, Zhilei Liu, Jianfei Cai, Yunsheng Wu, and Lizhuang Ma. 2019. Facial action unit detection using attention and relation learning.IEEE transactions on affective computing13, 3 (2019), 1274–1289
2019
-
[32]
Zhiwen Shao, Hancheng Zhu, Yong Zhou, Xiang Xiang, Bing Liu, Rui Yao, and Lizhuang Ma. 2025. Facial action unit detection by adaptively constraining self- attention and causally deconfounding sample.International Journal of Computer Vision133, 4 (2025), 1711–1726
2025
-
[33]
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models.Advances in neural information processing systems28 (2015)
2015
-
[34]
Juan Song and Zhilei Liu. 2023. Self-Supervised Facial Action Unit Detection with Region and Relation Learning. InIEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023. IEEE, 1–5
2023
-
[35]
Tengfei Song, Lisha Chen, Wenming Zheng, and Qiang Ji. 2021. Uncertain graph neural networks for facial action unit detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5993–6001
2021
-
[36]
Wenyu Song, Shuze Shi, Yu Dong, and Gaoyun An. 2022. Heterogeneous spatio- temporal relation learning network for facial action unit detection.Pattern Recognition Letters164 (2022), 268–275
2022
-
[37]
Yan Tong, Wenhui Liao, and Qiang Ji. 2007. Facial action unit recognition by exploiting their dynamic and semantic relationships.IEEE transactions on pattern analysis and machine intelligence29, 10 (2007), 1683–1699
2007
-
[38]
Petar Velikovi, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks.CoRR(2017)
2017
-
[39]
Haoran Wang, Weitang Liu, Alex Bocchieri, and Yixuan Li. 2021. Can multi- label classification networks know what they don’t know?Advances in Neural Information Processing Systems34 (2021), 29074–29087
2021
-
[40]
Zihan Wang, Siyang Song, Cheng Luo, Songhe Deng, Weicheng Xie, and Linlin Shen. 2024. Multi-scale dynamic and hierarchical relationship modeling for facial action units recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1270–1280
2024
-
[41]
Bohao Xing, Kaishen Yuan, Zitong Yu, Xin Liu, and Heikki Kälviäinen. 2025. AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection. In 2025 IEEE International Conference on Multimedia and Expo (ICME). 1–6
2025
-
[42]
Jingwei Yan, Jingjing Wang, Qiang Li, Chunmao Wang, and Shiliang Pu. 2022. Weakly supervised regional and temporal learning for facial action unit recogni- tion.IEEE Transactions on Multimedia25 (2022), 1760–1772
2022
-
[43]
Jing Yang, Yordan Hristov, Jie Shen, Yiming Lin, and Maja Pantic. 2023. Toward robust facial action units’ detection.Proc. IEEE111, 10 (2023), 1198–1214
2023
-
[44]
Jing Yang, Jie Shen, Yiming Lin, Yordan Hristov, and Maja Pantic. 2023. Fan-trans: Online knowledge distillation for facial action unit detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6019–6027
2023
-
[45]
Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. 2014. Bp4d-spontaneous: a high- resolution spontaneous 3d dynamic facial expression database.Image and Vision Computing32, 10 (2014), 692–706
2014
-
[46]
Yong Zhang, Weiming Dong, Bao-Gang Hu, and Qiang Ji. 2018. Classifier learning with prior probabilities for facial action unit recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5108–5116
2018
-
[47]
Chen Zhao, Dawei Du, Anthony Hoogs, and Christopher Funk. 2023. Open set action recognition via multi-label evidential learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22982–22991
2023
-
[48]
Haoliang Zhou, Shucheng Huang, and Yuqiao Xu. 2025. UA-FER: Uncertainty- aware representation learning for facial expression recognition.Neurocomputing 621 (2025), 129261
2025
-
[49]
Qing Zhu, Qirong Mao, Jialin Zhang, Xiaohua Huang, and Wenming Zheng
-
[50]
Towards a robust group-level emotion recognition via uncertainty-aware learning.IEEE Transactions on Affective Computing(2025)
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.