pith. sign in

arxiv: 2606.22725 · v1 · pith:PATSDHC7new · submitted 2026-06-21 · 💻 cs.CV · cs.AI

Interpretable Uncertainty Routing Separating Emotion Ambiguity from Distribution Shift in Facial Expression Recognition

Pith reviewed 2026-06-26 10:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords facial expression recognitionuncertainty decompositionaleatoric uncertaintyepistemic uncertaintyannotator disagreementdistribution shiftrouting mechanism
0
0 comments X

The pith

Uncertainty decomposition separates emotion ambiguity from distribution shift for differentiated routing in facial expression recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Facial expression recognition must handle two distinct problems that call for different responses: faces where human annotators disagree on the expressed emotion, and inputs that fall outside the training distribution. A single uncertainty score mixes the two, but decomposing uncertainty into aleatoric and epistemic components lets the system report ambiguity on the first while rejecting the second. The paper obtains both components from a deep ensemble of fine-tuned models and checks each against an external signal, finding that aleatoric uncertainty aligns with annotator disagreement while epistemic uncertainty flags corrupted images. This split powers an inference-time routing method that keeps substantially more ambiguous but in-distribution faces than a single-uncertainty baseline while maintaining the same out-of-distribution rejection rate. The advantage is shown to come specifically from the ability to choose different actions rather than from uncertainty measurement alone.

Core claim

Uncertainty-Aware Routing exploits the separation of aleatoric uncertainty, which recovers human annotator disagreement at Spearman correlation 0.66, from epistemic uncertainty, which detects corruption-induced distribution shift at average AUROC 0.699. The routing mechanism therefore reports ambiguity for in-distribution faces and rejects out-of-distribution inputs, retaining approximately 1.8 times more ambiguous faces than single-uncertainty routing at a matched out-of-distribution rejection rate. A label-distribution-learning baseline recovers disagreement comparably yet cannot perform the differentiated routing because it lacks the separation.

What carries the argument

Uncertainty-Aware Routing (UAR), an inference-time mechanism that applies separate thresholds to aleatoric uncertainty for reporting ambiguity and to epistemic uncertainty for rejection.

If this is right

  • Ambiguous in-distribution faces can be surfaced with their disagreement level instead of being discarded.
  • Out-of-distribution inputs can be rejected without also discarding valid but ambiguous cases.
  • Label distribution learning recovers annotator disagreement but supplies no mechanism for choosing different actions on shift.
  • The separation enables interpretable selection between reporting and rejection at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition could support routing in other label-ambiguous tasks such as medical image diagnosis where both disagreement and domain shift appear.
  • Real-world deployment could route low-ambiguity in-distribution cases to automated output while routing high-ambiguity or shifted cases to human review.
  • Extending the validation beyond synthetic corruptions to natural domain shifts would test whether epistemic uncertainty remains a reliable shift detector.

Load-bearing premise

That the aleatoric component extracted from the ensemble is a faithful proxy for human annotator disagreement and the epistemic component is a faithful proxy for distribution shift induced by image corruptions.

What would settle it

A direct comparison of UAR routing decisions against human judgments on whether each face should be reported with its ambiguity or rejected outright.

Figures

Figures reproduced from arXiv: 2606.22725 by Keito Inoshita, Takato Ueno.

Figure 1
Figure 1. Figure 1: Overall pipeline of the proposed dual-validated uncertainty decomposition and routing framework. Selective prediction allows a model to abstain on uncertain inputs, reducing risk at the cost of coverage [6]. Confidence estimation based on failure predic￾tion [2] improves abstention criteria, and extensions to learning-to-defer [21] en￾able routing to multiple specialists. In subjective tasks such as emotio… view at source ↗
Figure 2
Figure 2. Figure 2: UAR routing mechanism: each input is assigned to one of three actions based on independent thresholds on Hepi and Hale. the top-τ percentile are treated as positive examples of high disagreement, and the area under the ROC curve (AUROC) of Hale for this binary classification task, together with the Spearman correlation between Hale and d, measures how faithfully aleatoric uncertainty tracks annotator disag… view at source ↗
Figure 3
Figure 3. Figure 3: Scatter of decomposed uncertainties on clean FERPlus test images and OOD inputs, coloured by annotator disagreement; red dash-dot line: single-scalar threshold. Evaluation metrics. Accuracy, expected calibration error (ECE), Jensen– Shannon divergence to the human voting distribution, AUROC, and Spearman correlation are reported. All key values carry 95% confidence intervals from 2,000 bootstrap iterations… view at source ↗
Figure 4
Figure 4. Figure 4: Dual-validation results: panel (a) Hale deciles vs. mean annotator disagreement; panel (b) OOD detection AUROC by corruption severity and type. for aleatoric and 0.861 for epistemic, with the gap robust across positive-example thresholds from the top 20% to 50% of voting entropy (0.931 to 0.836). LDL achieves comparable recovery (ρ = 0.671, ADD AUROC 0.910); all ensem￾ble members are trained with hard labe… view at source ↗
Figure 5
Figure 5. Figure 5: Routing performance comparison: panel (a) aggregate routing AUC across all methods; panel (b) per-corruption routing AUC for decomposed epistemic versus single maximum probability. ature calibration is monotone and preserves rankings, epistemic detection re￾mains superior to the temperature-calibrated baseline. Paired bootstrap tests at the highest severity confirm significant advantages for Gaussian noise… view at source ↗
read the original abstract

Facial expression recognition (FER) is inherently ambiguous: human annotators frequently disagree, and models deployed in real environments face distribution shift. Crucially, these two conditions demand different downstream actions, as ambiguous in-distribution faces should be reported with their ambiguity whereas out-of-distribution inputs should be rejected. However, a single uncertainty score conflates the two. In this study, uncertainty decomposition into aleatoric and epistemic components for FER is investigated, and Uncertainty-Aware Routing (UAR), an inference-time routing mechanism that exploits the separation, is introduced. Specifically, aleatoric and epistemic uncertainties are obtained from a Deep Ensemble of fully fine-tuned DINOv2 models and are each validated against an independent external signal: aleatoric against human annotator disagreement, and epistemic against distribution shift induced by image corruptions. The proposed dual-validation protocol reveals that aleatoric recovers annotator disagreement with Spearman correlation 0.66 (95% CI: 0.64-0.68), and epistemic detects corruption-induced shifts, achieving average AUROC of 0.699 at the highest corruption severity. UAR retains approximately 1.8 times more ambiguous in-distribution faces than single-uncertainty routing at a matched out-of-distribution rejection rate. A strong label-distribution-learning baseline achieves comparable disagreement recovery but cannot separate ambiguity from shift and therefore cannot route, establishing that the value of decomposition lies in the separation enabling interpretable and differentiated action selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that uncertainty in facial expression recognition can be decomposed into aleatoric and epistemic components using a Deep Ensemble of fine-tuned DINOv2 models. Aleatoric uncertainty is validated against human annotator disagreement (Spearman 0.66), epistemic against corruption-induced distribution shifts (average AUROC 0.699), and the resulting Uncertainty-Aware Routing (UAR) retains ~1.8x more ambiguous in-distribution faces than single-uncertainty routing at matched OOD rejection rates, while a label-distribution-learning baseline cannot separate the signals for routing.

Significance. If the components are specific to their target signals, the work provides a practical mechanism for differentiated actions in FER deployment (report ambiguity vs. reject shift). The dual-validation against independent external signals and the quantitative retention gain are concrete strengths that would support the value of decomposition over conflated uncertainty.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (dual-validation protocol): Spearman 0.66 for aleatoric vs. annotator disagreement and AUROC 0.699 for epistemic vs. corruptions are reported, but no cross-sensitivity results are given (e.g., does aleatoric rise under corruptions; does epistemic rise with annotator disagreement). This test is load-bearing for the claim that the decomposition enables clean, interpretable routing separation.
  2. [§4.3] §4.3 (UAR evaluation): the 1.8x retention advantage at matched OOD rejection rate is attributed to the interpretable decomposition, yet without the cross-sensitivity evidence the gain cannot be unambiguously credited to separation rather than possible entanglement of the two uncertainty estimates.
minor comments (2)
  1. [Methods] Methods section: the number of ensemble members and the precise formulas used to extract aleatoric (e.g., expected entropy) and epistemic (e.g., mutual information) uncertainties from the DINOv2 ensemble predictions are not stated.
  2. [Abstract] Abstract: the 95% CI (0.64-0.68) on the Spearman correlation is given without the underlying sample size or computation method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify that cross-sensitivity tests would strengthen the evidence for clean separation of the uncertainty components. We address each point below and will revise the manuscript to incorporate the requested analyses.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (dual-validation protocol): Spearman 0.66 for aleatoric vs. annotator disagreement and AUROC 0.699 for epistemic vs. corruptions are reported, but no cross-sensitivity results are given (e.g., does aleatoric rise under corruptions; does epistemic rise with annotator disagreement). This test is load-bearing for the claim that the decomposition enables clean, interpretable routing separation.

    Authors: We agree that the absence of cross-sensitivity results leaves open the possibility of entanglement. Our dual-validation protocol uses independent external signals, but we did not explicitly test whether aleatoric uncertainty increases under corruptions or whether epistemic uncertainty correlates with annotator disagreement. We will compute and report these cross-sensitivity results (including quantitative measures and visualizations) in the revised §4 and abstract to directly address this concern. revision: yes

  2. Referee: [§4.3] §4.3 (UAR evaluation): the 1.8x retention advantage at matched OOD rejection rate is attributed to the interpretable decomposition, yet without the cross-sensitivity evidence the gain cannot be unambiguously credited to separation rather than possible entanglement of the two uncertainty estimates.

    Authors: The 1.8x retention gain is measured using the separated uncertainties for differentiated routing actions. We acknowledge that without cross-sensitivity evidence it is not possible to fully rule out entanglement as an alternative explanation for the observed advantage. We will add the cross-sensitivity results and revise the discussion and attribution in §4.3 to reflect the new evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external validations independent of routing rule

full rationale

The paper obtains aleatoric and epistemic uncertainties from a Deep Ensemble of fine-tuned DINOv2 models using standard decomposition. These are validated against independent external signals (human annotator disagreement via Spearman correlation, corruption-induced shifts via AUROC), which are not derived from the same fitted parameters or routing rule. The UAR retention advantage (1.8x) is reported as an empirical comparison at matched rejection rates against single-uncertainty routing and a label-distribution-learning baseline. No equations or claims reduce by construction to inputs; no self-citations are invoked as load-bearing uniqueness theorems; the separation enabling differentiated actions is measured against quantities outside the model (annotator labels, synthetic corruptions). This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard ensemble variance for epistemic uncertainty and on the assumption that model disagreement on corrupted images proxies real distribution shift; no new entities are postulated and no free parameters are explicitly fitted beyond standard training choices.

axioms (1)
  • domain assumption Deep ensemble disagreement separates aleatoric from epistemic uncertainty in the manner required for the routing rule.
    Invoked when the paper states that aleatoric and epistemic components are obtained from the ensemble and each validated separately.

pith-pipeline@v0.9.1-grok · 5793 in / 1468 out tokens · 42376 ms · 2026-06-26T10:19:29.195755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    In: Proceedings of the 18th ACM International Conference on Multimodal Interactio

    Barsoum, E., Zhang, C., Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interactio. pp. 279–283 (2016).https://doi.org/10.1145/2993148.2993165

  2. [2]

    In: Proceedings of the 33rd International Conference on Neural Information Processing Systems

    Corbière, C., Thome, N., Bar-Hen, A., Cord, M., Pérez, P.: Addressing failure prediction by learning model confidence. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. pp. 2902–2913. No. 261 (2019).https://doi.org/10.5555/3454287.3454548

  3. [3]

    In: Proceedings of the 35 th International Conference on Machine Learning

    Depeweg, S., Hernández-Lobato, J.M., Doshi-Velez, F., Udluft, S.: Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. In: Proceedings of the 35 th International Conference on Machine Learning. pp. 1184–1193 (2018)

  4. [4]

    In: Proceedings of the 9th International Conference on Learning Representations (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16×16 words: Transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations (2021)

  5. [5]

    In: Proceedings of The 33rd International Conference on Machine Learning

    Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: Proceedings of The 33rd International Conference on Machine Learning. pp. 1050–1059 (2016)

  6. [6]

    In: Proceedings of the 31st International Conference on Neural Information Processing Systems

    Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 4885–4894 (2017).https://doi.org/10.5555/3295222.3295241

  7. [7]

    IEEE Transactions on Knowledge and Data Engineering28(7), 1734–1748 (2016).https://doi.org/10.1109/TKDE.2016

    Geng, X.: Label distribution learning. IEEE Transactions on Knowledge and Data Engineering28(7), 1734–1748 (2016).https://doi.org/10.1109/TKDE.2016. 2545658

  8. [8]

    In: Proceedings of the 34th International Conference on Machine Learning

    Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neu- ral networks. In: Proceedings of the 34th International Conference on Machine Learning. pp. 1321–1330 (2017)

  9. [9]

    In: Proceedings of the 7th International Conference on Learning Representations (2019)

    Hendrycks,D.,Dietterich,T.:Benchmarkingneuralnetworkrobustnesstocommon corruptions and perturbations. In: Proceedings of the 7th International Conference on Learning Representations (2019)

  10. [10]

    In: Proceedings of the 5th International Conference on Learning Representations (2017)

    Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of- distribution examples in neural networks. In: Proceedings of the 5th International Conference on Learning Representations (2017)

  11. [11]

    Machine Learnings110, 457– 506 (2021).https://doi.org/10.1007/s10994-021-05946-3

    Hüllermeier, E., Waegeman, W.: Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learnings110, 457– 506 (2021).https://doi.org/10.1007/s10994-021-05946-3

  12. [12]

    Inoshita, K.: Bridging the silos in affective AI: A critical perspective from data to society (2026).https://doi.org/10.2139/ssrn.6774479, sSRN

  13. [13]

    Yu, Critical Ledgers and Scale-Defect Cascades for Navier–Stokes, arXiv preprint arXiv:2606.13887 [math.AP], 2026

    Inoshita, K., Ueno, T.: Bayesian spectral emotion transition discovery from multi- annotator disagreement. arXiv (2026).https://doi.org/10.48550/arXiv.2606. 01906

  14. [14]

    arXiv (2026).https://doi.org/10.48550/ arXiv.2605.24773

    Inoshita, K., Ueno, T.: Uncertainty decomposition via cyclical SG-MCMC and soft-label learning for subjective NLP. arXiv (2026).https://doi.org/10.48550/ arXiv.2605.24773

  15. [15]

    LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human-LLM Judgment Gaps

    Inoshita, K., Zhou, X., Kawai, A., Yada, K.: LLMs capture emotion labels, not emotion uncertainty: Distributional analysis and calibration of human-LLM judg- ment gaps. arXiv (2026).https://doi.org/10.48550/arXiv.2604.27345 16 K. Inoshita and T. Ueno

  16. [16]

    Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 5580–5590 (2017).https://doi.org/10. 5555/3295222.3295309

  17. [17]

    In: Proceedings of the 31st Inter- national Conference on Neural Information Processing Systems

    Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of the 31st Inter- national Conference on Neural Information Processing Systems. pp. 6405–6416 (2017).https://doi.org/10.5555/3295222.3295387

  18. [18]

    DSFormer: A Dual -domain Self - supervised Transformer for Accelerated Multi -contrast MRI Reconstruction,

    Le, N., Nguyen, K., Tran, Q., Tjiputra, E., Le, B., Nguyen, A.: Uncertainty-aware label distribution learning for facial expression recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6088–6097 (2023).https://doi.org/10.1109/WACV56688.2023.00603

  19. [19]

    Lee, J., Choi, Y., Kim, H., Kim, I.J., Nam, G.P.: Navigating label ambiguity for facial expression recognition in the wild. vol. 39, pp. 4517–4525 (2025).https: //doi.org/10.1609/aaai.v39i4.32476

  20. [20]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2852–2861 (2017). https://doi.org/10.1109/CVPR.2017.277

  21. [21]

    In: Proceedings of the 37th International Conference on Neural In- formation Processing Systems

    Mao, A., Mohri, C., Mohri, M., Zhong, Y.: Two-stage learning to defer with mul- tiple experts. In: Proceedings of the 37th International Conference on Neural In- formation Processing Systems. pp. 3578–3606. No. 159 (2023).https://doi.org/ 10.5555/3666122.3666281

  22. [22]

    Mao, J., Xu, R., Yin, X., Chang, Y., Nie, B., Huang, A., Wang, Y.: POSTER++: A simpler and stronger facial expression recognition network157(C) (2025).https: //doi.org/10.1016/j.patcog.2024.110951

  23. [23]

    Transactions on Ma- chine Learning Research (2024), arXiv:2304.07193

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fer- nandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Syn- naeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual fe...

  24. [24]

    Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lak- shminarayanan, B., Snoek, J.: Can you trust your model’s uncertainty? Evaluating predictiveuncertaintyunderdatasetshift.In:Proceedingsofthe33rdInternational Conference on Neural Information Processing Systems. pp. 14003–14014. No. 1254 (2019).https://doi.org/10.5555/3454...

  25. [25]

    The ``Problem'' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

    Plank, B.: The “Problem” of human label variation: On ground truth in data, modeling and evaluation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. pp. 10671–10682 (2022).https://doi. org/10.18653/v1/2022.emnlp-main.731

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition

    She, J., Hu, Y., Shi, H., Wang, J., Shen, Q., Mei, T.: Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recog- nition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 6248–6257 (2021).https://doi.org/10.1109/CVPR46437. 2021.00618

  27. [27]

    Journal of Artificial Intelligence Research , volume =

    Uma, A., Fornaciari, T., Hovy, D., Paun, S., Plank, B., Poesio, M.: Learning from disagreement: A survey. Journal of Artificial Intelligence Research72, 1385–1470 (2021).https://doi.org/10.1613/jair.1.12752 Interpretable Uncertainty Routing for FER 17

  28. [28]

    Explainable object-induced action decision for autonomous vehicles

    Wang, K., Peng, X., Yang, J., Lu, S., Qiao, Y.: Suppressing uncertainties for large- scale facial expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6897–6906 (2020).https:// doi.org/10.1109/CVPR42600.2020.00693

  29. [29]

    In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

    Wu, Z., Cui, J.: LA-Net: Landmark-aware learning for reliable facial expression recognition under label noise. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 20698–20707 (2023).https://doi.org/10.1109/ ICCV51070.2023.01892

  30. [30]

    In: Proceedings of the 35th International Conference on Neural Information Processing Systems

    Zhang, Y., Wang, C., Deng, W.: Relative uncertainty learning for facial expres- sion recognition. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. pp. 17616–17627. No. 1348 (2021)

  31. [31]

    In: Computer Vision – ECCV

    Zhang, Y., Wang, C., Ling, X., Deng, W.: Learn from all: Erasing attention con- sistency for noisy label facial expression recognition. In: Computer Vision – ECCV

  32. [32]

    418–434 (2022).https://doi.org/10.1007/978-3-031-19809-0_24

    pp. 418–434 (2022).https://doi.org/10.1007/978-3-031-19809-0_24

  33. [33]

    2024 , url =

    Zhang, Z., Zhao, P., Park, E., Yang, J.: MART: Masked affective RepresenTa- tion learning via masked temporal distribution distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12830– 12840 (2024).https://doi.org/10.1109/CVPR52733.2024.01219

  34. [34]

    Neurocomputing621, 129261 (2025).https:// doi.org/10.1016/j.neucom.2024.129261

    Zhou, H., Huang, S., Xu, Y.: UA-FER: Uncertainty-aware representation learning for facial expression recognition. Neurocomputing621, 129261 (2025).https:// doi.org/10.1016/j.neucom.2024.129261