pith. sign in

arxiv: 2605.04439 · v1 · submitted 2026-05-06 · 💻 cs.CV

A cross-modal network for facial expression recognition

Pith reviewed 2026-05-08 18:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords facial expression recognitioncross-modal networkface symmetryfeature fusiondeep neural networksalient refinementhalf-face alignmentexpression classification
0
0 comments X

The pith

CMNet recognizes facial expressions by combining symmetric features from whole and half faces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents CMNet, a cross-modal network for facial expression recognition that uses face symmetry to learn from a whole face as well as its left and right halves. The goal is to extract complementary features that capture both biological and structural aspects of expressions. A refinement module selects salient information to avoid instability when fusing these features. A separate alignment step ensures that the left and right half-face features correspond properly. The authors report that this design allows CMNet to perform better than prior methods including SCN and LAENet-SA.

Core claim

CMNet can respectively learn expression information via face symmetry on a whole face, left and right half faces to extract complementary facial features. To prevent negative effect of biological and structural information fusion, a salient facial information refinement module can obtain salient facial expression information to improve stability of an obtained facial expression classifier. To reduce reliance on unilateral facial features, a half-face alignment optimization mechanism is designed to align obtained expression information of learned left and right half faces. Experimental results demonstrate that CMNet outperforms SCN and LAENet-SA for facial expression recognition.

What carries the argument

Cross-modal network (CMNet) with salient facial information refinement module and half-face alignment optimization mechanism that processes whole-face and half-face inputs symmetrically to extract complementary features.

Load-bearing premise

That fusing biological and structural information from whole and half faces via the salient facial information refinement module and half-face alignment optimization mechanism does not produce negative effects and instead improves stability and performance of the obtained facial expression classifier.

What would settle it

A direct comparison on standard facial expression benchmarks showing that CMNet does not exceed the accuracy of SCN or LAENet-SA would falsify the outperformance claim.

Figures

Figures reproduced from arXiv: 2605.04439 by Chao Li, Chunwei Tian, Jingyuan Xie, Qi Zhang, Shichao Zhang, Wangmeng Zuo.

Figure 1
Figure 1. Figure 1: The architecture of the proposed CMNet for facial expression recognition. view at source ↗
Figure 2
Figure 2. Figure 2: That is, firstly, a division method of central point view at source ↗
Figure 3
Figure 3. Figure 3: Part facial images with seven emotions from view at source ↗
Figure 4
Figure 4. Figure 4: Part facial images with seven emotions from view at source ↗
Figure 5
Figure 5. Figure 5: Part facial images with eight emotions from AffectNet dataset. For context-sensitive scenes, CAER-S dataset [37] and SFEW 2.0 dataset [41] are used to conduct comparative experiments to test robustness of different methods for emotion recognition in a dynamic and context-sensitive environment in this paper. Specifically, CAER-S dataset was created by selecting static images from video clips in the CAER dat… view at source ↗
Figure 8
Figure 8. Figure 8: The attention visualization result generated by Grad view at source ↗
Figure 9
Figure 9. Figure 9: The accuracy on RAF-DB using different values of view at source ↗
Figure 10
Figure 10. Figure 10: Confusion matrix of our CMNet for cross-database: view at source ↗
read the original abstract

Deep neural networks enriched with structural information have been widely employed for facial expression recognition tasks. However, these methods often depend on hierarchical information rather than face property to finish expression recognition. In this paper, we propose a cross-modal network with strong biological and structural information for facial expression recognition (CMNet). CMNet can respectively learn expression information via face symmetry on a whole face, left and right half faces to extract complementary facial features. To prevent negative effect of biological and structural information fusion, a salient facial information refinement module can obtain salient facial expression information to improve stability of an obtained facial expression classifier. To reduce reliance on unilateral facial features, a half-face alignment optimization mechanism is designed to align obtained expression information of learned left and right half faces. Our experimental results demonstrate that CMNet outperforms several novel methods, i.e., SCN and LAENet-SA for facial expression recognition. Codes can be obtained at https://github.com/hellloxiaotian/CMNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes CMNet, a cross-modal network for facial expression recognition that processes whole-face, left-half-face, and right-half-face inputs to exploit symmetry for complementary expression features. It introduces a salient facial information refinement module to extract salient information and avoid negative fusion effects, plus a half-face alignment optimization mechanism to align half-face features and reduce unilateral reliance. The central empirical claim is that CMNet outperforms SCN and LAENet-SA.

Significance. If substantiated by controlled experiments, the incorporation of explicit biological priors (symmetry) and structural fusion mechanisms could offer a practical route to more stable FER models. The public code release aids reproducibility.

major comments (3)
  1. [Abstract] Abstract: the claim that CMNet 'outperforms several novel methods, i.e., SCN and LAENet-SA' is presented without any mention of datasets, training protocols, ablation studies, statistical tests, or error bars, so it is impossible to attribute gains to the proposed modules rather than uncontrolled factors.
  2. [Method] Method section (salient facial information refinement module): the assertion that this module 'can obtain salient facial expression information to improve stability' and 'prevent negative effect of biological and structural information fusion' is load-bearing for the central claim, yet no ablation (full CMNet vs. variant lacking the module) or capacity-matched baseline is reported.
  3. [Method] Method section (half-face alignment optimization mechanism): the claim that the mechanism 'align[s] obtained expression information of learned left and right half faces' and thereby reduces unilateral reliance lacks supporting controlled experiments that would demonstrate it mitigates negative fusion rather than simply adding parameters.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'CMNet can respectively learn expression information via face symmetry on a whole face, left and right half faces' is unclear and should be reworded for precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that the abstract and experimental validation can be strengthened for clarity and rigor. We will revise the manuscript accordingly by expanding the abstract with experimental details and adding targeted ablation studies for the proposed modules.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that CMNet 'outperforms several novel methods, i.e., SCN and LAENet-SA' is presented without any mention of datasets, training protocols, ablation studies, statistical tests, or error bars, so it is impossible to attribute gains to the proposed modules rather than uncontrolled factors.

    Authors: We agree that the abstract should provide more context to support the performance claim. In the revised version, we will update the abstract to explicitly mention the datasets (RAF-DB and FER2013), training protocols (including data augmentation and optimization details), reference to ablation studies in Section 4, and note that results include error bars with statistical significance testing. These details are already present in the experimental section and will now be summarized in the abstract to better attribute improvements to the cross-modal design and modules. revision: yes

  2. Referee: [Method] Method section (salient facial information refinement module): the assertion that this module 'can obtain salient facial expression information to improve stability' and 'prevent negative effect of biological and structural information fusion' is load-bearing for the central claim, yet no ablation (full CMNet vs. variant lacking the module) or capacity-matched baseline is reported.

    Authors: We acknowledge that a direct ablation isolating the salient facial information refinement module would provide stronger evidence. The current manuscript demonstrates overall superiority over SCN and LAENet-SA, but to directly address this point we will add a new ablation study in the revised experiments section: comparing full CMNet against a variant without the refinement module, plus a capacity-matched baseline (e.g., by adjusting channel dimensions to equalize parameters). This will quantify the module's contribution to stability and negative fusion prevention. revision: yes

  3. Referee: [Method] Method section (half-face alignment optimization mechanism): the claim that the mechanism 'align[s] obtained expression information of learned left and right half faces' and thereby reduces unilateral reliance lacks supporting controlled experiments that would demonstrate it mitigates negative fusion rather than simply adding parameters.

    Authors: We agree that controlled experiments are necessary to isolate the effect of the half-face alignment optimization mechanism. While the overall results support reduced unilateral reliance through the cross-modal design, we will add an ablation in the revision: full CMNet versus a variant without the alignment mechanism, including metrics on feature alignment (e.g., cosine similarity between left/right features) and performance under asymmetric conditions. This will demonstrate that the mechanism mitigates negative fusion beyond mere parameter addition. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated by external comparisons

full rationale

The paper proposes CMNet, a cross-modal network using whole-face and half-face symmetry to extract complementary features, with two custom modules (salient facial information refinement and half-face alignment optimization) to mitigate fusion issues. All load-bearing claims are supported by end-to-end experimental accuracy gains against SCN and LAENet-SA on standard benchmarks. No equations, fitted parameters renamed as predictions, self-citations forming uniqueness theorems, or ansatzes smuggled via prior work appear in the derivation chain. The architecture choices are presented as design decisions justified by biological intuition and then tested empirically, with no reduction of outputs to inputs by construction. This is a standard empirical DL proposal whose validity hinges on reproducible experiments rather than internal self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard deep-learning assumptions plus domain-specific priors about facial symmetry and the benefit of multi-view fusion; no machine-checked proofs or parameter-free derivations are present. The new modules are architectural inventions whose value is asserted via experiments.

free parameters (1)
  • Module design choices and hyperparameters
    Standard deep network weights and architectural decisions (layer sizes, fusion weights, alignment parameters) are learned or chosen to fit the data.
axioms (2)
  • domain assumption Facial expressions are reliably encoded in symmetric and half-face structural information
    Invoked when the network is designed to extract complementary features from whole, left, and right faces.
  • domain assumption Fusing multi-view facial features improves classifier stability when properly refined
    Core premise behind the salient refinement and alignment modules.
invented entities (2)
  • Salient facial information refinement module no independent evidence
    purpose: To extract salient expression information and prevent negative effects from information fusion
    New module introduced to improve stability of the classifier.
  • Half-face alignment optimization mechanism no independent evidence
    purpose: To align learned expression information from left and right half faces and reduce unilateral reliance
    New mechanism proposed to balance half-face contributions.

pith-pipeline@v0.9.0 · 5470 in / 1585 out tokens · 47196 ms · 2026-05-08T18:36:35.087556+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 1 internal anchor

  1. [1]

    Expression systems: Editorial overview,

    A. R. Shatzman, “Expression systems: Editorial overview,” Curr . Opin. Biotechnol. , vol. 4, no. 5, pp. 517–519, 1993

  2. [2]

    Predicting personalized image emotion perceptions in social networks,

    S. Zhao, H. Y ao, Y . Gao, G. Ding, and T.-S. Chua, “Predicting personalized image emotion perceptions in social networks,” IEEE Trans. Affective Comput., vol. 9, no. 4, pp. 526–540, 2016

  3. [3]

    To- ward label-efficient emotion and sentiment analysis,

    S. Zhao, X. Hong, J. Y ang, Y . Zhao, and G. Ding, “To- ward label-efficient emotion and sentiment analysis,” Proc. IEEE , vol. 111, no. 10, pp. 1159–1197, 2023

  4. [4]

    Constants across cultures in the face and emotion.,

    P . Ekman and W. V . Friesen, “Constants across cultures in the face and emotion.,” J. Pers. Soc. Psychol., vol. 17, no. 2, p. 124, 1971

  5. [5]

    Attention mechanisms in computer vision: A survey,

    M.-H. Guo et al., “Attention mechanisms in computer vision: A survey,” Comput. Visual Media , vol. 8, no. 3, pp. 331–368, 2022

  6. [6]

    Region attention networks for pose and occlusion ro- bust facial expression recognition,

    K. Wang, X. Peng, J. Y ang, D. Meng, and Y . Qiao, “Region attention networks for pose and occlusion ro- bust facial expression recognition,” IEEE Trans. Image Process., vol. 29, pp. 4057–4069, 2020

  7. [7]

    Light attention embedding for facial expression recognition,

    C. Wang, J. Xue, K. Lu, and Y . Y an, “Light attention embedding for facial expression recognition,” IEEE Trans. Circuits Syst. Video Technol. , vol. 32, no. 4, pp. 1834–1847, 2021

  8. [8]

    Facial expression recogni- tion in the wild via deep attentive center loss,

    A. H. Farzaneh and X. Qi, “Facial expression recogni- tion in the wild via deep attentive center loss,” in Proc. IEEE Winter Conf. Comput. Vis. Appl. (WACV) , Virtual, Jan. 2021, pp. 2402–2411

  9. [9]

    Learning deep global multi-scale and local attention features for facial ex- pression recognition in the wild,

    Z. Zhao, Q. Liu, and S. Wang, “Learning deep global multi-scale and local attention features for facial ex- pression recognition in the wild,” IEEE Trans. Image Process., vol. 30, pp. 6544–6556, 2021

  10. [10]

    Occlusion aware facial expression recognition using CNN with attention mechanism,

    Y . Li, J. Zeng, S. Shan, and X. Chen, “Occlusion aware facial expression recognition using CNN with attention mechanism,” IEEE Trans. Image Process. , vol. 28, no. 5, pp. 2439–2450, 2018

  11. [11]

    Affective image content analysis: Two decades review and new perspectives,

    S. Zhao et al., “Affective image content analysis: Two decades review and new perspectives,” IEEE Trans. Pat- tern Anal. Mach. Intell. , vol. 44, no. 10, pp. 6729–6751, 2021

  12. [12]

    Coding facial expressions with gabor wavelets,

    M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, “Coding facial expressions with gabor wavelets,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), 1998, pp. 200–205

  13. [13]

    Ica and gabor representation for facial expression recognition,

    I. Buciu, I. Pitas, et al., “Ica and gabor representation for facial expression recognition,” in Proc. Int. Conf. Image Process. (ICIP) , vol. 2, 2003, pp. II–855

  14. [14]

    Sparse representation for accurate classi- fication of corrupted and occluded facial expressions,

    S. F. Cotter, “Sparse representation for accurate classi- fication of corrupted and occluded facial expressions,” in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2010, pp. 838–841

  15. [15]

    Accurate and robust facial expressions recognition by fusing multiple sparse representation based classifiers,

    Y . Ouyang, N. Sang, and R. Huang, “Accurate and robust facial expressions recognition by fusing multiple sparse representation based classifiers,” Neurocomput- ing, vol. 149, pp. 71–78, 2015

  16. [16]

    Selective transfer machine for personalized facial expression anal- ysis,

    W.-S. Chu, F. De la Torre, and J. F. Cohn, “Selective transfer machine for personalized facial expression anal- ysis,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 3, pp. 529–545, 2016

  17. [17]

    Two- dimensional discriminant multi-manifolds locality pre- serving projection for facial expression recognition,

    N. Zheng, X. Guo, L. Qi, and L. Guan, “Two- dimensional discriminant multi-manifolds locality pre- serving projection for facial expression recognition,” in Proc. Int. Symp. Circuits Syst. (ISCAS) , 2015, pp. 2065– 2068

  18. [18]

    Facial expression recognition using distance and shape signature features,

    A. Barman and P . Dutta, “Facial expression recognition using distance and shape signature features,” Pattern Recognit. Lett. , vol. 145, pp. 254–261, 2021

  19. [19]

    A comprehensive survey on deep facial expression recognition: Challenges, applications, and future guidelines,

    M. Sajjad et al., “A comprehensive survey on deep facial expression recognition: Challenges, applications, and future guidelines,” Alexandria Eng. J. , vol. 68, pp. 817–840, 2023

  20. [20]

    Adaptive weighting of handcrafted feature losses for facial expression recog- nition,

    W. Xie, L. Shen, and J. Duan, “Adaptive weighting of handcrafted feature losses for facial expression recog- nition,” IEEE Trans. Cybern. , vol. 51, no. 5, pp. 2787– 2800, 2019

  21. [21]

    La-net: Landmark-aware learning for reliable facial expression recognition under label noise,

    Z. Wu and J. Cui, “La-net: Landmark-aware learning for reliable facial expression recognition under label noise,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 20 698–20 707

  22. [22]

    A perception cnn for facial expression recognition,

    C. Tian, J. Xie, L. Li, W. Zuo, Y . Zhang, and D. Zhang, “A perception cnn for facial expression recognition,” IEEE Trans. Image Process. , vol. 34, pp. 8101–8113, 2025

  23. [23]

    Fa- cial expression recognition through cross-modality at- tention fusion,

    R. Ni, B. Y ang, X. Zhou, A. Cangelosi, and X. Liu, “Fa- cial expression recognition through cross-modality at- tention fusion,” IEEE Trans. Cognit. Dev. Syst. , vol. 15, no. 1, pp. 175–185, 2022

  24. [24]

    Feature decomposition and reconstruction learning for effective facial expression recognition,

    D. Ruan, Y . Y an, S. Lai, Z. Chai, C. Shen, and H. Wang, “Feature decomposition and reconstruction learning for effective facial expression recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 7660–7669

  25. [25]

    FERMixNet: An occlusion robust facial expression recognition model with facial mixing augmentation and mid-level representation learning,

    Y . Huang et al., “FERMixNet: An occlusion robust facial expression recognition model with facial mixing augmentation and mid-level representation learning,” IEEE Trans. Affective Comput. , 2024

  26. [26]

    Learning informative and discriminative features for facial expression recognition in the wild,

    Y . Li et al., “Learning informative and discriminative features for facial expression recognition in the wild,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 5, pp. 3178–3189, 2021

  27. [27]

    Cmdvit: A voluntary facial expression recognition model for complex mental disorders,

    J. Y e et al., “Cmdvit: A voluntary facial expression recognition model for complex mental disorders,” IEEE Trans. Image Process. , 2025

  28. [28]

    Co-attentive multi-task convolu- tional neural network for facial expression recognition,

    W. Y u and H. Xu, “Co-attentive multi-task convolu- tional neural network for facial expression recognition,” Pattern Recognit., vol. 123, p. 108 401, 2022

  29. [29]

    JADFER: Exploring spatial-contextual interaction with joint attention dropping for facial ex- 12 pression recognition,

    Y . Gao et al., “JADFER: Exploring spatial-contextual interaction with joint attention dropping for facial ex- 12 pression recognition,” IEEE Trans. Affective Comput. , 2024

  30. [30]

    Mhan: Multi-head hybrid attention net- work for facial expression recognition,

    X. Wang et al., “Mhan: Multi-head hybrid attention net- work for facial expression recognition,” Pattern Recog- nit., vol. 170, p. 112 015, 2026

  31. [31]

    Multi- relations aware network for in-the-wild facial expres- sion recognition,

    D. Chen, G. Wen, H. Li, R. Chen, and C. Li, “Multi- relations aware network for in-the-wild facial expres- sion recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 3848–3859, 2023

  32. [32]

    Relation-aware facial expression recognition,

    Y . Xia, H. Y u, X. Wang, M. Jian, and F.-Y . Wang, “Relation-aware facial expression recognition,” IEEE Trans. Cognit. Dev. Syst., vol. 14, no. 3, pp. 1143–1154, 2021

  33. [33]

    Adaptive multilayer perceptual attention network for facial ex- pression recognition,

    H. Liu, H. Cai, Q. Lin, X. Li, and H. Xiao, “Adaptive multilayer perceptual attention network for facial ex- pression recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 6253–6266, 2022

  34. [34]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) , Las V egas, Nevada, USA, Jun. 2016, pp. 770–778

  35. [35]

    CBAM: Convolutional block attention module,

    S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” in 2018 Proc. Eur . Conf. Comput. Vis. (ECCV) , Berlin, Heidelberg: Springer-V erlag, 2018, pp. 3–19

  36. [36]

    Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,

    J. S. Bridle, “Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,” in Neurocomputing, F. F. Soulié and J. Hérault, Eds., Berlin, Heidelberg, 1990, pp. 227–236, ISBN : 978-3-642-76153-9

  37. [37]

    Context- aware emotion recognition networks,

    J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn, “Context- aware emotion recognition networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV) , Seoul, South Korea, 2019, pp. 10 143–10 152

  38. [38]

    Challenges in representation learning: A report on three machine learning contests,

    I. J. Goodfellow et al., “Challenges in representation learning: A report on three machine learning contests,” Neural Networks , pp. 117–124, 2013

  39. [39]

    Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,

    S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , Honolulu, Hawaii, USA, Jun. 2017, pp. 2852–2861

  40. [40]

    Af- fectnet: A database for facial expression, valence, and arousal computing in the wild,

    A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Af- fectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Trans. Affective Comput., vol. 10, no. 1, pp. 18–31, 2017

  41. [41]

    Video and image based emotion recogni- tion challenges in the wild: Emotiw 2015,

    A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, and T. Gedeon, “Video and image based emotion recogni- tion challenges in the wild: Emotiw 2015,” in Proc. ACM Int. Conf. Multimodal Interaction ACM ICMI , Seattle, Washington, USA, Nov. 2015, pp. 423–426

  42. [42]

    Acted facial expressions in the wild database,

    A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Acted facial expressions in the wild database,” ANU Tech. Rep. TR-CS-11, vol. 2, no. 1, 2011

  43. [43]

    Ms- celeb-1m: A dataset and benchmark for large-scale face recognition,

    Y . Guo, L. Zhang, Y . Hu, X. He, and J. Gao, “Ms- celeb-1m: A dataset and benchmark for large-scale face recognition,” in Proc. Eur . Conf. Comput. Vis. (ECCV) , Amsterdam, the Netherlands, Oct. 2016, pp. 87–102

  44. [44]

    Retinaface: Single-shot multi-level face localisation in the wild,

    J. Deng, J. Guo, E. V erveras, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , Virtual, Jun. 2020, pp. 5203–5212

  45. [45]

    Adam: A Method for Stochastic Optimization

    D. P . Kingma and J. Ba, “Adam: A method for stochas- tic optimization,” arXiv:1412.6980, 2014

  46. [46]

    Distract your attention: Multi-head cross attention network for facial expression recognition,

    Z. Wen, W. Lin, T. Wang, and G. Xu, “Distract your attention: Multi-head cross attention network for facial expression recognition,” Biomimetics, vol. 8, no. 2, p. 199, 2023

  47. [47]

    A stochastic approximation method,

    H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Stat. , pp. 400–407, 1951

  48. [48]

    Gradient-based learning applied to document recog- nition,

    Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner, “Gradient-based learning applied to document recog- nition,” Proc. IEEE , vol. 86, no. 11, pp. 2278–2324, 2002

  49. [49]

    Using the original and symmetrical facetraining samples to perform representation based two-step face recogni- tion,

    Y . Xu, X. Zhu, Z. Li, G. Liu, Y . Lu, and H. Liu, “Using the original and symmetrical facetraining samples to perform representation based two-step face recogni- tion,” Pattern Recognit., vol. 46, no. 4, pp. 1151–1158, 2013

  50. [50]

    Grad-cam++: Generalized gradient- based visual explanations for deep convolutional net- works,

    A. Chattopadhay, A. Sarkar, P . Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient- based visual explanations for deep convolutional net- works,” in Proc. IEEE Winter Conf. Comput. Vis. Appl. (WACV), Nevada, USA: IEEE, Mar. 2018, pp. 839–847

  51. [51]

    Pose-adaptive hi- erarchical attention network for facial expression recog- nition,

    Y . Liu, J. Peng, J. Zeng, and S. Shan, “Pose-adaptive hi- erarchical attention network for facial expression recog- nition,” arXiv:1905.10059, 2019

  52. [52]

    Robust lightweight facial expression recognition network with label distribution training,

    Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial expression recognition network with label distribution training,” in AAAI Conf. Artif. Intell. , Issue: 4, vol. 35, Virtual, Feb. 2021, pp. 3510–3519

  53. [53]

    Facial expression recognition with inconsistently annotated datasets,

    J. Zeng, S. Shan, and X. Chen, “Facial expression recognition with inconsistently annotated datasets,” in Proc. Eur . Conf. Comput. Vis. (ECCV) , Munich, Ger- many, Sep. 2018, pp. 222–237

  54. [54]

    Sup- pressing uncertainties for large-scale facial expression recognition,

    K. Wang, X. Peng, J. Y ang, S. Lu, and Y . Qiao, “Sup- pressing uncertainties for large-scale facial expression recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , Virtual, Jun. 2020, pp. 6897–6906

  55. [55]

    Cnn-based facial affect anal- ysis on mobile devices,

    C. Hewitt and H. Gunes, “Cnn-based facial affect anal- ysis on mobile devices,” arXiv:1807.08775, 2018

  56. [56]

    FG-AGR: Fine-grained associative graph representa- tion for facial expression recognition in the wild,

    C. Li, X. Li, X. Wang, D. Huang, Z. Liu, and L. Liao, “FG-AGR: Fine-grained associative graph representa- tion for facial expression recognition in the wild,” IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 2, pp. 882–896, 2023, Publisher: IEEE

  57. [57]

    Efficient fa- cial feature learning with wide ensemble-based con- volutional neural networks,

    H. Siqueira, S. Magg, and S. Wermter, “Efficient fa- cial feature learning with wide ensemble-based con- volutional neural networks,” in AAAI Conf. Artif. In- tell., Issue: 04, vol. 34, New Y ork, USA, Feb. 2020, pp. 5800–5809

  58. [58]

    FE-SpikeFormer: A camera-based fa- cial expression recognition method for hospital health monitoring,

    Z. Dong et al., “FE-SpikeFormer: A camera-based fa- cial expression recognition method for hospital health monitoring,” IEEE J. Biomed. Health. Inf. , pp. 1–11, 2025. 13

  59. [59]

    Unconstrained facial expression recognition with no- reference de-elements learning,

    H. Li, N. Wang, X. Y ang, X. Wang, and X. Gao, “Unconstrained facial expression recognition with no- reference de-elements learning,” IEEE Trans. Affective Comput., vol. 15, no. 1, pp. 173–185, 2024

  60. [60]

    Learning a facial expression embedding disentangled from identity,

    W. Zhang, X. Ji, K. Chen, Y . Ding, and C. Fan, “Learning a facial expression embedding disentangled from identity,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , Virtual, Jun. 2021, pp. 6759–6768

  61. [61]

    MMA Trans: Muscle movement aware representation learning for facial expression recognition via transformers,

    H. Liu et al., “MMA Trans: Muscle movement aware representation learning for facial expression recognition via transformers,” IEEE Trans. Ind. Inf. , 2024

  62. [62]

    Learn from all: Erasing attention consistency for noisy label facial expression recognition,

    Y . Zhang, C. Wang, X. Ling, and W. Deng, “Learn from all: Erasing attention consistency for noisy label facial expression recognition,” in 2022 Proc. Eur . Conf. Comput. Vis. (ECCV) , Tel Aviv, Israel, Oct. 2022, pp. 418–434

  63. [63]

    Face2exp: Combating data biases for facial expression recognition,

    D. Zeng, Z. Lin, X. Y an, Y . Liu, F. Wang, and B. Tang, “Face2exp: Combating data biases for facial expression recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , New Orleans, Louisiana, USA, Jun. 2022, pp. 20 291–20 300

  64. [64]

    Adap- tively learning facial expression representation via c- f labels and distillation,

    H. Li, N. Wang, X. Ding, X. Y ang, and X. Gao, “Adap- tively learning facial expression representation via c- f labels and distillation,” IEEE Trans. Image Process. , vol. 30, pp. 2016–2028, 2021

  65. [65]

    A novel lightweight facial expression recognition network based on deep shallow network fusion and attention mechanism,

    Q. Y ang, Y . He, H. Chen, Y . Wu, and Z. Rao, “A novel lightweight facial expression recognition network based on deep shallow network fusion and attention mechanism,” Algorithms, vol. 18, no. 8, 2025

  66. [66]

    Decoding group emotional dynamics in a web-based collaborative environment: A novel framework utiliz- ing multi-person facial expression recognition,

    Q. Li, Z. Liu, Z. Zhang, Q. Wang, and M. Ma, “Decoding group emotional dynamics in a web-based collaborative environment: A novel framework utiliz- ing multi-person facial expression recognition,” Int. J. Hum.-Comput. Interact., vol. 41, no. 5, pp. 3455–3473, 2025

  67. [67]

    Weighted classification of deep and traditional histogram-based features with kernel representation for robust facial expression recognition,

    M. Najmabadi, M. Masoudifar, and A. Hajipour, “Weighted classification of deep and traditional histogram-based features with kernel representation for robust facial expression recognition,” Appl. Soft Com- put., vol. 182, p. 113 630, 2025

  68. [68]

    Facial expression recogni- tion with visual transformers and attentional selective fusion,

    F. Ma, B. Sun, and S. Li, “Facial expression recogni- tion with visual transformers and attentional selective fusion,” IEEE Trans. Affective Comput. , vol. 14, no. 2, pp. 1236–1248, 2021, Publisher: IEEE

  69. [69]

    Learning vision transformer with squeeze and excitation for facial expression recogni- tion,

    M. Aouayeb, W. Hamidouche, C. Soladie, K. Kpalma, and R. Seguier, “Learning vision transformer with squeeze and excitation for facial expression recogni- tion,” arXiv:2107.03107, 2021

  70. [70]

    Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition,

    J. She, Y . Hu, H. Shi, J. Wang, Q. Shen, and T. Mei, “Dive into ambiguity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , Virtual, Jun. 2021, pp. 6248–6257

  71. [71]

    A novel attention residual network expression recognition method,

    H. Qi, X. Zhang, Y . Shi, and X. Qi, “A novel attention residual network expression recognition method,” IEEE Access, vol. 12, pp. 24 609–24 620, 2024

  72. [72]

    Pose-aware facial expression recognition assisted by expression descriptions,

    S. Wang, Y . Wu, Y . Chang, G. Li, and M. Mao, “Pose-aware facial expression recognition assisted by expression descriptions,” IEEE Trans. Affective Com- put., vol. 15, no. 1, pp. 241–253, 2024

  73. [73]

    Human emotion recognition with relational region-level analysis,

    W. Li, X. Dong, and Y . Wang, “Human emotion recognition with relational region-level analysis,” IEEE Trans. Affective Comput. , vol. 14, no. 1, pp. 650–663, 2023

  74. [74]

    Label distribution learning on auxiliary label space graphs for facial expression recognition,

    S. Chen, J. Wang, Y . Chen, Z. Shi, X. Geng, and Y . Rui, “Label distribution learning on auxiliary label space graphs for facial expression recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Virtual, Jun. 2020, pp. 13 984–13 993

  75. [75]

    Facial expression recognition in the wild using multi-level fea- tures and attention mechanisms,

    Y . Li, G. Lu, J. Li, Z. Zhang, and D. Zhang, “Facial expression recognition in the wild using multi-level fea- tures and attention mechanisms,” IEEE Trans. Affective Comput., vol. 14, no. 1, pp. 451–462, 2020, Publisher: IEEE

  76. [76]

    Searching for mobilenetv3,

    A. Howard et al., “Searching for mobilenetv3,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV) , Seoul, South Korea, Oct. 2019, pp. 1314–1324. Chunwei Tian (Senior Member, IEEE) received the Ph.D. degree from Harbin Institute of Tech- nology, Harbin, China, in 2021. He is currently a Professor with the School of Computer Science and Technology, Harbin Instit...