arxiv: 2604.21227 · v1 · submitted 2026-04-23 · 💻 cs.CV · cs.MM

Recognition: unknown

UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection

Yuze Li , Zhilei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:35 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords facial action unit detectionuncertainty modelingconditional variational autoencoderevidential neural networkmulti-label classificationimbalanced dataprobabilistic representations

0 comments

The pith

UAU-Net improves facial action unit detection by explicitly modeling uncertainty in both feature representation and classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that facial action unit detection suffers from uncertainties at representation and decision levels, which standard deterministic methods overlook. By introducing a conditional variational autoencoder for probabilistic features and an evidential network with asymmetric loss for predictions, UAU-Net claims to achieve stronger performance and better calibrated outputs on imbalanced datasets like BP4D and DISFA. A sympathetic reader would care because reliable AU detection supports applications in emotion recognition and human-computer interaction where noisy labels and variations are common.

Core claim

UAU-Net is an uncertainty-aware framework for AU detection. At the representation stage, the CV-AFE module uses a conditional VAE to learn AU feature means and variances across scales, conditioned on labels to capture inter-AU dependencies. At the decision stage, the AB-ENN uses Beta distributions to parameterize uncertainty and an asymmetric loss to handle label imbalance, reducing overconfident predictions.

What carries the argument

CV-AFE, a conditional VAE-based module that learns probabilistic AU representations by estimating means and variances conditioned on labels, and AB-ENN, an asymmetric Beta evidential neural network that models predictive uncertainty with Beta distributions and applies an asymmetric loss for imbalanced multi-label detection.

If this is right

Improved robustness to visual noise and subject variations through variance estimation in features.
More reliable confidence estimates that avoid overconfidence on ambiguous or imbalanced labels.
Enhanced capture of inter-AU relationships via label conditioning in the representation module.
Overall higher F1 scores and better calibration metrics on standard AU detection benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this dual uncertainty modeling to other computer vision tasks involving ambiguous labels could yield similar gains in reliability.
Integration with temporal modeling might further address dynamic AU sequences in video.
Testing on datasets with different imbalance ratios would help confirm the asymmetric loss's effectiveness beyond BP4D and DISFA.

Load-bearing premise

The introduced CV-AFE and AB-ENN modules effectively capture the heterogeneous AU-specific uncertainties without introducing artifacts or overfitting to the BP4D and DISFA datasets.

What would settle it

Evaluating UAU-Net on a new AU dataset with substantially different visual conditions, label distributions, or noise levels, and finding no improvement in performance or calibration compared to deterministic baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.21227 by Yuze Li, Zhilei Liu.

**Figure 1.** Figure 1: Architecture of our UAU-Net framework, in which CVAE-based feature extraction (CV-AFE) and asymmetric Beta [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Failure-case analysis of the baseline and UAU-Net [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Facial action unit (AU) detection remains challenging because it involves heterogeneous, AU-specific uncertainties arising at both the representation and decision stages. Recent methods have improved discriminative feature learning, but they often treat the AU representations as deterministic, overlooking uncertainty caused by visual noise, subject-dependent appearance variations, and ambiguous inter-AU relationships, all of which can substantially degrade robustness. Meanwhile, conventional point-estimation classifiers often provide poorly calibrated confidence, producing overconfident predictions, especially under the severe label imbalance typical of AU datasets. We propose UAU-Net, an Uncertainty-aware AU detection framework that explicitly models uncertainty at both stages. At the representation stage, we introduce CV-AFE, a conditional VAE (CVAE)-based AU feature extraction module that learns probabilistic AU representations by jointly estimating feature means and variances across multiple spatio-temporal scales; conditioning on AU labels further enables CV-AFE to capture uncertainty associated with inter-AU dependencies. At the decision stage, we design AB-ENN, an Asymmetric Beta Evidential Neural Network for multi-label AU detection, which parameterizes predictive uncertainty with Beta distributions and mitigates overconfidence via an asymmetric loss tailored to highly imbalanced binary labels. Extensive experiments on BP4D and DISFA show that UAU-Net achieves strong AU detection performance, and further analyses indicate that modeling uncertainty in both representation learning and evidential prediction improves robustness and reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UAU-Net combines a conditional VAE for multi-scale probabilistic AU features with an asymmetric Beta evidential classifier to address uncertainty and imbalance, but the gains rest on unshown experiments.

read the letter

The main takeaway is that this paper builds an uncertainty-aware pipeline for facial action unit detection by splitting the problem into representation and decision stages. CV-AFE uses a conditional VAE to produce means and variances at multiple spatio-temporal scales while conditioning on labels to reflect inter-AU dependencies. AB-ENN then models the output with Beta distributions and applies an asymmetric loss to counter the severe class imbalance typical in AU datasets. That pairing is the concrete new element; prior work has used VAEs or evidential networks separately, but not this exact joint setup for AU detection. The motivation is sound: visual noise, subject variation, and ambiguous AU co-occurrences do create heterogeneous uncertainty that deterministic features and point-estimate classifiers ignore. Treating both stages explicitly is a reasonable response. The experiments are run on the usual BP4D and DISFA sets, which at least keeps the comparison grounded. The soft spots are the missing details. The abstract states that performance improves and that uncertainty modeling helps robustness, yet supplies no F1 scores, no baseline numbers, no ablation tables, and no calibration metrics. Without those, it is impossible to judge whether the modules actually reduce overconfidence or simply add capacity that any stronger backbone might match. The claim that the modules avoid artifacts also needs the full results to evaluate. This paper is for researchers already working on affective computing or multi-label facial analysis who care about reliability under imbalance. It is not a broad methodological advance, but the targeted fix is clear enough that a serious editor should send it to referees who can check the numbers and the calibration plots. I would accept it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes UAU-Net, an uncertainty-aware framework for facial action unit (AU) detection. It introduces CV-AFE, a conditional VAE-based module that learns probabilistic AU representations by estimating means and variances across spatio-temporal scales while conditioning on AU labels to capture inter-AU dependencies and uncertainties from noise/variations. At the decision stage, AB-ENN parameterizes predictive uncertainty via Beta distributions and employs an asymmetric loss to address overconfidence under severe label imbalance. Experiments on BP4D and DISFA are claimed to show strong detection performance with further analyses indicating that joint uncertainty modeling at both stages improves robustness and reliability.

Significance. If the quantitative claims hold, the work has moderate significance for affective computing and computer vision. Explicitly modeling heterogeneous uncertainties in both representation learning and evidential classification addresses a recognized limitation of deterministic AU detectors. The combination of CVAE conditioning and asymmetric Beta evidential networks is a coherent technical contribution that could improve calibration and robustness in imbalanced, ambiguous settings typical of real-world facial analysis.

major comments (3)

[Abstract] Abstract: the central claim of 'strong AU detection performance' and 'improved robustness and reliability' is not supported by any numbers, baselines, ablation tables, or statistical tests in the provided abstract; without these the magnitude and reliability of the gains cannot be assessed.
[Section 4] Section 4 (Experiments): the claim that uncertainty modeling improves performance rests on results from BP4D and DISFA, yet no mention is made of multiple random seeds, standard deviations, or significance testing (e.g., paired t-tests against baselines); this undermines the assertion that the gains are reproducible and not due to initialization or dataset-specific tuning.
[Section 3.2] Section 3.2 (AB-ENN): the asymmetric loss is presented as essential for mitigating overconfidence on imbalanced binary labels, but the manuscript does not report an ablation replacing it with a symmetric evidential loss; without this comparison it is unclear whether the asymmetry is load-bearing for the reliability improvements.

minor comments (2)

[Section 3.1] The description of CV-AFE conditioning on AU labels should explicitly state whether label information is available at inference time or only during training, as this affects practical deployment.
[Figure 1] Figure 1 (architecture overview) would benefit from clearer annotation of the variance outputs from CV-AFE and the Beta parameters from AB-ENN to help readers trace the uncertainty flow.

Circularity Check

0 steps flagged

No circularity in the proposed UAU-Net framework

full rationale

The paper introduces two new modules, CV-AFE (conditional VAE-based feature extractor) and AB-ENN (asymmetric beta evidential neural network), as independent architectural contributions for modeling uncertainty at representation and decision stages. These are not defined in terms of the target AU detection performance or fitted parameters that are then renamed as predictions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work are present in the abstract or described contributions. Performance claims rest on experiments on BP4D and DISFA datasets rather than reducing to the inputs by construction. The derivation chain is self-contained with external empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are detailed beyond the introduction of the CV-AFE and AB-ENN modules themselves.

pith-pipeline@v0.9.0 · 5546 in / 1061 out tokens · 29298 ms · 2026-05-09T22:35:43.001514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references

[1]

R Rashmi Adyapady and B Annappa. 2023. A comprehensive review of facial expression recognition techniques.Multimedia Systems29, 1 (2023), 73–103. UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands

2023
[2]

Jiyuan Cao, Zhilei Liu, and Yong Zhang. 2022. Cross-subject Action Unit Detection with Meta Learning and Transformer-based Relation Modeling. InInternational Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 2022. IEEE, 1–8

2022
[3]

Jie Chang, Zhonghao Lan, Changmao Cheng, and Yichen Wei. 2020. Data uncer- tainty learning in face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5710–5719

2020
[4]

Yanan Chang, Caichao Zhang, Yi Wu, and Shangfei Wang. 2024. Facial Action Unit Recognition Enhanced by Text Descriptions of FACS.IEEE Transactions on Affective Computing(2024)

2024
[5]

Lei Chen, Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen Lu. 2022. Uncertainty- Aware Representation Learning for Action Segmentation.. InIJCAI, Vol. 2. 6

2022
[6]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255

2009
[7]

Paul Ekman and Wallace V Friesen. 1978. Facial action coding system.Environ- mental Psychology & Nonverbal Behavior(1978)

1978
[8]

Zongbo Han, Changqing Zhang, Huazhu Fu, and Joey Tianyi Zhou. 2022. Trusted multi-view classification with dynamic evidential fusion.IEEE transactions on pattern analysis and machine intelligence45, 2 (2022), 2551–2566

2022
[9]

Wenchong He, Zhe Jiang, Tingsong Xiao, Zelin Xu, and Yukun Li. 2025. A survey on uncertainty quantification methods for deep learning.Comput. Surveys(2025)

2025
[10]

Geethu Miriam Jacob and Bjorn Stenger. 2021. Facial action unit detection with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7680–7689

2021
[11]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings

2014
[12]

Simon Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R Ledsam, Klaus Maier-Hein, SM Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. 2018. A probabilistic u-net for segmentation of ambiguous images.Advances in neural information processing systems31 (2018)

2018
[13]

Guanbin Li, Xin Zhu, Yirui Zeng, Qing Wang, and Liang Lin. 2019. Semantic relationships guided representation learning for facial action unit recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8594–8601

2019
[14]

Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. 2018. Eac-net: Deep nets with enhancing and cropping for facial action unit detection.IEEE transactions on pattern analysis and machine intelligence40, 11 (2018), 2583–2596

2018
[15]

attention

Xiaotian Li, Zhihua Li, Huiyuan Yang, Geran Zhao, and Lijun Yin. 2021. Your “attention” deserves attention: A self-diversified multi-channel attention for facial action analysis. In2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). IEEE, 01–08

2021
[16]

Xiaotian Li, Xiang Zhang, Taoyue Wang, and Lijun Yin. 2023. Knowledge- spreader: Learning semi-supervised facial action dynamics by consistifying knowledge granularity. InProceedings of the IEEE/CVF international conference on computer vision. 20979–20989

2023
[17]

Yante Li, Jinsheng Wei, Yang Liu, Janne Kauttonen, and Guoying Zhao. 2022. Deep learning for micro-expression recognition: A survey.IEEE Transactions on Affective Computing13, 4 (2022), 2028–2046

2022
[18]

Zhihua Li, Xiang Deng, Xiaotian Li, and Lijun Yin. 2021. Integrating semantic and temporal relationships in facial action unit detection. InProceedings of the 29th ACM international conference on multimedia. 5519–5527

2021
[19]

Xin Liu, Kaishen Yuan, Xuesong Niu, Jingang Shi, Zitong Yu, Huanjing Yue, and Jingyu Yang. 2024. Multi-scale promoted self-adjusting correlation learning for facial action unit detection.IEEE Transactions on Affective Computing(2024)

2024
[20]

Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations

2018
[21]

Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. 2022. Learning Multi-dimensional Edge Feature-based AU Relation Graph for Facial Action Unit Recognition. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence ...

2022
[22]

S Mohammad Mavadati, Mohammad H Mahoor, Kevin Bartlett, Philip Trinh, and Jeffrey F Cohn. 2013. Disfa: A spontaneous facial action intensity database.IEEE Transactions on Affective Computing4, 2 (2013), 151–160

2013
[23]

Xuesong Niu, Hu Han, Songfan Yang, Yan Huang, and Shiguang Shan. 2019. Local relationship learning with person-specific shape regularization for facial action unit detection. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 11917–11926

2019
[24]

Yang Qin, Dezhong Peng, Xi Peng, Xu Wang, and Peng Hu. 2022. Deep evidential learning with noisy correspondence for cross-modal retrieval. InProceedings of the 30th ACM International Conference on Multimedia. 4948–4956

2022
[25]

Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. 2021. Asymmetric Loss For Multi-Label Classification. In2021 IEEE/CVF International Conference on Computer Vision (ICCV). 82–91

2021
[26]

Murat Sensoy, Lance Kaplan, and Melih Kandemir. 2018. Evidential deep learning to quantify classification uncertainty.Advances in neural information processing systems31 (2018)

2018
[27]

Ziqiao Shang, Bin Liu, Fengmao Lv, Fei Teng, Tianrui Li, and Lan-Zhe Guo
[28]

Pattern Recognition173 (2026), 112746

Learning contrastive feature representations for facial action unit detection. Pattern Recognition173 (2026), 112746

2026
[29]

Zhiwen Shao, Bikuan Chen, Yong Zhou, Xuehuai Shi, Canlin Li, Lizhuang Ma, and Dit-Yan Yeung. 2026. Constrained and directional ensemble attention for facial action unit detection.Pattern Recognition169 (2026), 111904

2026
[30]

Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. 2021. Jaa-net: joint facial action unit detection and face alignment via adaptive attention.International Journal of Computer Vision129, 2 (2021), 321–340

2021
[31]

Zhiwen Shao, Zhilei Liu, Jianfei Cai, Yunsheng Wu, and Lizhuang Ma. 2019. Facial action unit detection using attention and relation learning.IEEE transactions on affective computing13, 3 (2019), 1274–1289

2019
[32]

Zhiwen Shao, Hancheng Zhu, Yong Zhou, Xiang Xiang, Bing Liu, Rui Yao, and Lizhuang Ma. 2025. Facial action unit detection by adaptively constraining self- attention and causally deconfounding sample.International Journal of Computer Vision133, 4 (2025), 1711–1726

2025
[33]

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models.Advances in neural information processing systems28 (2015)

2015
[34]

Juan Song and Zhilei Liu. 2023. Self-Supervised Facial Action Unit Detection with Region and Relation Learning. InIEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023. IEEE, 1–5

2023
[35]

Tengfei Song, Lisha Chen, Wenming Zheng, and Qiang Ji. 2021. Uncertain graph neural networks for facial action unit detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5993–6001

2021
[36]

Wenyu Song, Shuze Shi, Yu Dong, and Gaoyun An. 2022. Heterogeneous spatio- temporal relation learning network for facial action unit detection.Pattern Recognition Letters164 (2022), 268–275

2022
[37]

Yan Tong, Wenhui Liao, and Qiang Ji. 2007. Facial action unit recognition by exploiting their dynamic and semantic relationships.IEEE transactions on pattern analysis and machine intelligence29, 10 (2007), 1683–1699

2007
[38]

Petar Velikovi, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks.CoRR(2017)

2017
[39]

Haoran Wang, Weitang Liu, Alex Bocchieri, and Yixuan Li. 2021. Can multi- label classification networks know what they don’t know?Advances in Neural Information Processing Systems34 (2021), 29074–29087

2021
[40]

Zihan Wang, Siyang Song, Cheng Luo, Songhe Deng, Weicheng Xie, and Linlin Shen. 2024. Multi-scale dynamic and hierarchical relationship modeling for facial action units recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1270–1280

2024
[41]

Bohao Xing, Kaishen Yuan, Zitong Yu, Xin Liu, and Heikki Kälviäinen. 2025. AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection. In 2025 IEEE International Conference on Multimedia and Expo (ICME). 1–6

2025
[42]

Jingwei Yan, Jingjing Wang, Qiang Li, Chunmao Wang, and Shiliang Pu. 2022. Weakly supervised regional and temporal learning for facial action unit recogni- tion.IEEE Transactions on Multimedia25 (2022), 1760–1772

2022
[43]

Jing Yang, Yordan Hristov, Jie Shen, Yiming Lin, and Maja Pantic. 2023. Toward robust facial action units’ detection.Proc. IEEE111, 10 (2023), 1198–1214

2023
[44]

Jing Yang, Jie Shen, Yiming Lin, Yordan Hristov, and Maja Pantic. 2023. Fan-trans: Online knowledge distillation for facial action unit detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6019–6027

2023
[45]

Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. 2014. Bp4d-spontaneous: a high- resolution spontaneous 3d dynamic facial expression database.Image and Vision Computing32, 10 (2014), 692–706

2014
[46]

Yong Zhang, Weiming Dong, Bao-Gang Hu, and Qiang Ji. 2018. Classifier learning with prior probabilities for facial action unit recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5108–5116

2018
[47]

Chen Zhao, Dawei Du, Anthony Hoogs, and Christopher Funk. 2023. Open set action recognition via multi-label evidential learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22982–22991

2023
[48]

Haoliang Zhou, Shucheng Huang, and Yuqiao Xu. 2025. UA-FER: Uncertainty- aware representation learning for facial expression recognition.Neurocomputing 621 (2025), 129261

2025
[49]

Qing Zhu, Qirong Mao, Jialin Zhang, Xiaohua Huang, and Wenming Zheng
[50]

Towards a robust group-level emotion recognition via uncertainty-aware learning.IEEE Transactions on Affective Computing(2025)

2025