pith. machine review for the scientific record. sign in

arxiv: 2604.10541 · v2 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

Bidirectional Learning of Facial Action Units and Expressions via Structured Semantic Mapping across Heterogeneous Datasets

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords facial action unitsfacial expression recognitionbidirectional learningsemantic mappingheterogeneous datasetscomputer visionaffective computing
0
0 comments X

The pith

Structured semantic mapping enables bidirectional learning of facial action units and expressions across heterogeneous datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that facial action unit detection and facial expression recognition can be learned jointly in both directions through structured semantic mapping, even when the datasets differ in annotation granularity, label levels, and availability. It does this by aligning fine-grained muscle activations with holistic affective states in a shared space using textual prototypes and prior knowledge from the Facial Action Coding System. A sympathetic reader would care because the approach shows how the natural correlation between these tasks can improve performance on both without needing perfectly matched training data or additional annotations.

Core claim

The Structured Semantic Mapping (SSM) framework achieves state-of-the-art performance on both AU detection and FE recognition benchmarks simultaneously by using a shared visual backbone for unified representations, a Textual Semantic Prototype module that builds structured semantic anchors from fixed textual descriptions plus learnable prompts, and a Dynamic Prior Mapping module that learns bidirectional associations in feature space, thereby showing that holistic expression semantics enhance fine-grained AU learning even across heterogeneous datasets.

What carries the argument

Structured Semantic Mapping (SSM) framework, which uses Textual Semantic Prototype (TSP) modules as cross-task alignment anchors in semantic space and Dynamic Prior Mapping (DPM) to enable explicit bidirectional knowledge transfer via a data-driven association matrix grounded in FACS priors.

Load-bearing premise

Fixed textual descriptions of action units and expressions, augmented with learnable context prompts, can reliably serve as supervision signals and cross-task alignment anchors without semantic drift or dataset-specific biases.

What would settle it

If ablation studies show that removing the TSP alignment or DPM bidirectional mapping produces no gains (or losses) on standard AU and FE benchmarks compared to separate or unidirectional training, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.10541 by Jia Li, Meng Wang, Richang Hong, Shiguang Shan, Yin Chen, Yong Li, Yu Zhang, Zhenzhen Hu.

Figure 1
Figure 1. Figure 1: Paradigm comparison of joint AU and FE learning. (a): conventional [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Structured Semantic Mapping (SSM) framework. SSM reformulates joint AU detection and DFER in a unified vision–text semantic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Our Baseline model for learning dynamic AUs and FEs jointly. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The linear feature transformation in DPM module is to multiply a [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The analysis of data scale, where 0% data scale indicates single-task [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the input facial frame sequences. The left side shows samples from the DFER task. The right side shows samples from the AU [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The upper part shows the weight matrix activation maps from AUs to facial expressions on DISFA and DFEW(fd5). From left to right, the three [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity analysis of the initial values of [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Facial action unit (AU) detection and facial expression (FE) recognition can be jointly viewed as affective facial behavior tasks, representing fine-grained muscular activations and coarse-grained holistic affective states, respectively. Despite their inherent semantic correlation, existing studies predominantly focus on knowledge transfer from AUs to FEs, while bidirectional learning remains insufficiently explored. In practice, this challenge is further compounded by heterogeneous data conditions, where AU and FE datasets differ in annotation paradigms (frame-level vs.\ clip-level), label granularity, and data availability and diversity, hindering effective joint learning. To address these issues, we propose a Structured Semantic Mapping (SSM) framework for bidirectional AU--FE learning under different data domains and heterogeneous supervision. SSM consists of three key components: (1) a shared visual backbone that learns unified facial representations from dynamic AU and FE videos; (2) semantic mediation via a Textual Semantic Prototype (TSP) module, which constructs structured semantic prototypes from fixed textual descriptions augmented with learnable context prompts, serving as supervision signals and cross-task alignment anchors in a shared semantic space; and (3) a Dynamic Prior Mapping (DPM) module that incorporates prior knowledge derived from the Facial Action Coding System and learns a data-driven association matrix in a high-level feature space, enabling explicit and bidirectional knowledge transfer. Extensive experiments on popular AU detection and FE recognition benchmarks show that SSM achieves state-of-the-art performance on both tasks simultaneously, and demonstrate that holistic expression semantics can in turn enhance fine-grained AU learning even across heterogeneous datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a Structured Semantic Mapping (SSM) framework for bidirectional learning of facial action unit (AU) detection and facial expression (FE) recognition across heterogeneous datasets. SSM includes a shared visual backbone for unified representations from AU and FE videos, a Textual Semantic Prototype (TSP) module that builds structured semantic prototypes from fixed textual AU/FE descriptions augmented by learnable context prompts to act as supervision and cross-task alignment anchors, and a Dynamic Prior Mapping (DPM) module that uses FACS-derived priors to learn a data-driven association matrix for explicit bidirectional knowledge transfer. The central claim is that this enables simultaneous state-of-the-art performance on both tasks and shows holistic FE semantics enhancing fine-grained AU learning despite differing annotation granularities and data domains.

Significance. If the empirical claims hold, the work would be significant for addressing the underexplored bidirectional direction in affective computing and for handling heterogeneous supervision (frame-level vs. clip-level annotations) via semantic mediation rather than one-way transfer. The TSP and DPM components represent a structured way to leverage external FACS knowledge alongside learnable elements, potentially generalizing to other multi-granularity vision tasks.

major comments (2)
  1. [Abstract / TSP module] Abstract and TSP module description: The central claim that fixed textual descriptions of AUs and FEs augmented with learnable context prompts reliably serve as supervision signals and cross-task alignment anchors without semantic drift is load-bearing for bidirectional transfer across heterogeneous datasets, yet no analysis, stability metrics, or ablation on prompt overfitting to dataset-specific cues is referenced.
  2. [Experiments section] DPM module and experiments: The assertion of SOTA performance on both AU detection and FE recognition benchmarks simultaneously, plus bidirectional enhancement, requires quantitative metrics, ablation studies isolating TSP/DPM contributions, and error analysis on cross-dataset transfer; these are not supplied in the result summary, leaving the empirical grounding for the holistic-to-fine-grained enhancement claim unverified.
minor comments (1)
  1. [Abstract] The abstract would benefit from including at least one key performance number (e.g., average F1 or accuracy improvement) to substantiate the SOTA claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below and have revised the manuscript to provide the requested analyses and metrics.

read point-by-point responses
  1. Referee: [Abstract / TSP module] Abstract and TSP module description: The central claim that fixed textual descriptions of AUs and FEs augmented with learnable context prompts reliably serve as supervision signals and cross-task alignment anchors without semantic drift is load-bearing for bidirectional transfer across heterogeneous datasets, yet no analysis, stability metrics, or ablation on prompt overfitting to dataset-specific cues is referenced.

    Authors: We agree that explicit analysis of prompt stability and potential semantic drift strengthens the TSP module claims. The fixed textual descriptions are intended to provide stable semantic anchors derived from standard AU/FE definitions, with learnable context prompts enabling data-driven adaptation. In the revised manuscript we have added a new analysis subsection (Section 4.3) containing: (i) stability metrics (standard deviation of performance across five random prompt initializations), (ii) an ablation comparing fixed-only versus augmented prompts, and (iii) cross-dataset consistency checks showing that performance gains do not degrade when prompts are transferred between heterogeneous datasets. These additions confirm that the learnable prompts improve alignment without introducing measurable semantic drift or dataset-specific overfitting. revision: yes

  2. Referee: [Experiments section] DPM module and experiments: The assertion of SOTA performance on both AU detection and FE recognition benchmarks simultaneously, plus bidirectional enhancement, requires quantitative metrics, ablation studies isolating TSP/DPM contributions, and error analysis on cross-dataset transfer; these are not supplied in the result summary, leaving the empirical grounding for the holistic-to-fine-grained enhancement claim unverified.

    Authors: The original manuscript reports simultaneous SOTA results on standard AU and FE benchmarks together with overall framework ablations. We acknowledge that more granular isolation of TSP versus DPM contributions and explicit error analysis on cross-dataset transfer would better substantiate the bidirectional enhancement claim. In the revised version we have expanded Section 5 with: (i) separate ablation tables quantifying the incremental contribution of TSP and DPM, (ii) additional quantitative metrics (F1, accuracy, and AUC) for both tasks under bidirectional versus unidirectional settings, and (iii) error analysis tables breaking down cross-dataset transfer performance by AU/FE category. These revisions directly support the claim that holistic FE semantics improve fine-grained AU detection across heterogeneous data. revision: yes

Circularity Check

0 steps flagged

No circularity: framework design relies on external FACS priors and empirical training

full rationale

The SSM framework introduces TSP (fixed textual AU/FE descriptions + learnable prompts as supervision anchors) and DPM (FACS-derived prior knowledge plus learned association matrix for bidirectional transfer). These are architectural choices trained end-to-end on heterogeneous datasets; performance claims are validated empirically on standard benchmarks rather than derived by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained against external FACS knowledge and data-driven optimization.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on two new modules whose effectiveness is not independently verified and on the domain assumption that FACS priors transfer cleanly across datasets.

free parameters (1)
  • learnable context prompts
    Adjustable vectors added to fixed textual descriptions inside the TSP module; their values are optimized during training.
axioms (1)
  • domain assumption The Facial Action Coding System supplies reliable prior knowledge that can be encoded as an association matrix for bidirectional AU-FE transfer.
    Invoked inside the DPM module to initialize and constrain the learned mapping.
invented entities (2)
  • Textual Semantic Prototype (TSP) module no independent evidence
    purpose: Constructs structured semantic prototypes from text to serve as supervision and alignment anchors.
    New component introduced to mediate between visual features and task semantics.
  • Dynamic Prior Mapping (DPM) module no independent evidence
    purpose: Learns a data-driven association matrix that enables explicit bidirectional knowledge transfer.
    New component that operationalizes FACS priors in feature space.

pith-pipeline@v0.9.0 · 5593 in / 1420 out tokens · 44422 ms · 2026-05-10T16:06:04.941865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 7 canonical work pages

  1. [1]

    Constants across cultures in the face and emotion

    P. Ekman and W. V . Friesen, “Constants across cultures in the face and emotion.”Journal of personality and social psychology, vol. 17, no. 2, p. 124, 1971

  2. [2]

    More evidence for the universality of a contempt expression,

    D. Matsumoto, “More evidence for the universality of a contempt expression,”Motivation and Emotion, vol. 16, no. 4, pp. 363–368, 1992

  3. [3]

    Facial action coding system,

    P. Ekman and W. V . Friesen, “Facial action coding system,”Environ- mental Psychology & Nonverbal Behavior, 1978

  4. [4]

    Facial expression recognition by de- expression residue learning,

    H. Yang, U. Ciftci, and L. Yin, “Facial expression recognition by de- expression residue learning,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2168–2177

  5. [5]

    Spontaneous facial expression analysis based on tem- perature changes and head motions,

    P. Liu and L. Yin, “Spontaneous facial expression analysis based on tem- perature changes and head motions,” in2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1. IEEE, 2015, pp. 1–6

  6. [6]

    Deep disturbance- disentangled learning for facial expression recognition,

    D. Ruan, Y . Yan, S. Chen, J.-H. Xue, and H. Wang, “Deep disturbance- disentangled learning for facial expression recognition,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2833–2841

  7. [7]

    Dive into ambiguity: Latent distribution mining and pairwise uncertainty estima- tion for facial expression recognition,

    J. She, Y . Hu, H. Shi, J. Wang, Q. Shen, and T. Mei, “Dive into ambiguity: Latent distribution mining and pairwise uncertainty estima- tion for facial expression recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6248– 6257

  8. [8]

    Exploiting semantic embedding and visual feature for facial action unit detection,

    H. Yang, L. Yin, Y . Zhou, and J. Gu, “Exploiting semantic embedding and visual feature for facial action unit detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 482–10 491

  9. [9]

    Facial action unit detection with transform- ers,

    G. M. Jacob and B. Stenger, “Facial action unit detection with transform- ers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7680–7689

  10. [10]

    Knowledge-driven self-supervised representa- tion learning for facial action unit recognition,

    Y . Chang and S. Wang, “Knowledge-driven self-supervised representa- tion learning for facial action unit recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 417–20 426

  11. [11]

    Compound expression recognition in-the-wild with au-assisted meta multi-task learning,

    X. Li, W. Deng, S. Li, and Y . Li, “Compound expression recognition in-the-wild with au-assisted meta multi-task learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5735–5744

  12. [12]

    Multi-label compound expression recognition: C-expr database & network,

    D. Kollias, “Multi-label compound expression recognition: C-expr database & network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5589–5598

  13. [13]

    Enhanced facial expression recognition based on facial action unit intensity and region,

    W. Chen and A. Wang, “Enhanced facial expression recognition based on facial action unit intensity and region,” in2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2023, pp. 1939–1944

  14. [14]

    Au-aware vision transformers for biased facial expression recognition,

    S. Mao, X. Li, Q. Wu, and X. Peng, “Au-aware vision transformers for biased facial expression recognition,”arXiv preprint arXiv:2211.06609, 2022

  15. [15]

    Recognizing action units for facial expression analysis,

    Y .-I. Tian, T. Kanade, and J. F. Cohn, “Recognizing action units for facial expression analysis,”IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 2, pp. 97–115, 2001

  16. [16]

    Action unit enhance dynamic facial expression recognition,

    F. Liu, L. Gu, C. Shi, and X. Fu, “Action unit enhance dynamic facial expression recognition,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 5597–5606

  17. [17]

    A unified approach to facial affect analysis: the mae-face visual representation,

    B. Ma, W. Zhang, F. Qiu, and Y . Ding, “A unified approach to facial affect analysis: the mae-face visual representation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5924–5933

  18. [18]

    An effective ensemble learning framework for affective behaviour analysis,

    W. Zhang, F. Qiu, C. Liu, L. Li, H. Du, T. Guo, and X. Yu, “An effective ensemble learning framework for affective behaviour analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4761–4772. 14

  19. [19]

    Advanced facial analysis in multi-modal data with cascaded cross-attention based transformer,

    J.-H. Kim, N. Kim, M. Hong, and C. S. Won, “Advanced facial analysis in multi-modal data with cascaded cross-attention based transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7870–7877

  20. [20]

    Expression, affect, action unit recognition: Aff-Wild2, multi-task learning and ArcFace,

    D. Kollias and S. Zafeiriou, “Expression, affect, action unit recog- nition: Aff-wild2, multi-task learning and arcface,”arXiv preprint arXiv:1910.04855, 2019

  21. [21]

    Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges,

    D. Kollias, “Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2328–2336

  22. [22]

    Prior aided streaming network for multi-task affective analysis,

    W. Zhang, Z. Guo, K. Chen, L. Li, Z. Zhang, Y . Ding, R. Wu, T. Lv, and C. Fan, “Prior aided streaming network for multi-task affective analysis,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3539–3549

  23. [23]

    Mtmsn: Multi-task and multi- modal sequence network for facial action unit and expression recog- nition,

    Y . Jin, T. Zheng, C. Gao, and G. Xu, “Mtmsn: Multi-task and multi- modal sequence network for facial action unit and expression recog- nition,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3597–3602

  24. [24]

    Multi-task learning for human affect prediction with auditory-visual synchronized representation,

    E. Jeong, G. Oh, and S. Lim, “Multi-task learning for human affect prediction with auditory-visual synchronized representation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2438–2445

  25. [25]

    Transformer-based multimodal information fusion for facial expression analysis,

    W. Zhang, F. Qiu, S. Wang, H. Zeng, Z. Zhang, R. An, B. Ma, and Y . Ding, “Transformer-based multimodal information fusion for facial expression analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2428–2437

  26. [26]

    Hsemotion team at the 7th abaw challenge: multi-task learning and compound facial expression recognition,

    A. V . Savchenko, “Hsemotion team at the 7th abaw challenge: multi-task learning and compound facial expression recognition,”arXiv preprint arXiv:2407.13184, 2024

  27. [27]

    Fedhca2: Towards hetero-client federated multi-task learning,

    Y . Lu, S. Huang, Y . Yang, S. Sirejiding, Y . Ding, and H. Lu, “Fedhca2: Towards hetero-client federated multi-task learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5599–5609

  28. [28]

    Het- erogeneous transfer learning: recent developments, applications, and challenges,

    S. Khan, P. Yin, Y . Guo, M. Asim, and A. A. Abd El-Latif, “Het- erogeneous transfer learning: recent developments, applications, and challenges,”Multimedia Tools and Applications, vol. 83, no. 27, pp. 69 759–69 795, 2024

  29. [29]

    Damex: Dataset-aware mixture- of-experts for visual understanding of mixture-of-datasets,

    Y . Jain, H. Behl, Z. Kira, and V . Vineet, “Damex: Dataset-aware mixture- of-experts for visual understanding of mixture-of-datasets,”Advances in Neural Information Processing Systems, vol. 36, pp. 69 625–69 637, 2023

  30. [30]

    Dfew: A large-scale database for recognizing dynamic facial expressions in the wild,

    X. Jiang, Y . Zong, W. Zheng, C. Tang, W. Xia, C. Lu, and J. Liu, “Dfew: A large-scale database for recognizing dynamic facial expressions in the wild,” inProceedings of the 28th ACM international conference on multimedia, 2020, pp. 2881–2889

  31. [31]

    Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild,

    Y . Liu, W. Dai, C. Feng, W. Wang, G. Yin, J. Zeng, and S. Shan, “Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild,” inProceedings of the 30th ACM international conference on multimedia, 2022, pp. 24–32

  32. [32]

    Ferv39k: A large-scale multi-scene dataset for facial expres- sion recognition in videos,

    Y . Wang, Y . Sun, Y . Huang, Z. Liu, S. Gao, W. Zhang, W. Ge, and W. Zhang, “Ferv39k: A large-scale multi-scene dataset for facial expres- sion recognition in videos,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 922–20 931

  33. [33]

    Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,

    X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard, “Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,”Image and Vision Computing, vol. 32, no. 10, pp. 692–706, 2014

  34. [34]

    Disfa: A spontaneous facial action intensity database,

    S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn, “Disfa: A spontaneous facial action intensity database,”IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 151–160, 2013

  35. [35]

    Logo-former: Local-global spatio-temporal transformer for dynamic facial expression recognition,

    F. Ma, B. Sun, and S. Li, “Logo-former: Local-global spatio-temporal transformer for dynamic facial expression recognition,” inICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  36. [36]

    Intensity-aware loss for dynamic facial expression recognition in the wild,

    H. Li, H. Niu, Z. Zhu, and F. Zhao, “Intensity-aware loss for dynamic facial expression recognition in the wild,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 1, 2023, pp. 67–75

  37. [37]

    Ex- pression snippet transformer for robust video-based facial expression recognition,

    Y . Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y . Zhan, “Ex- pression snippet transformer for robust video-based facial expression recognition,”Pattern Recognition, vol. 138, p. 109368, 2023

  38. [38]

    Lifting scheme-based implicit disentanglement of emotion-related facial dynamics in the wild,

    X. Wang and L. Chai, “Lifting scheme-based implicit disentanglement of emotion-related facial dynamics in the wild,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 7970–7978

  39. [39]

    Former-dfer: Dynamic facial expression recog- nition transformer,

    Z. Zhao and Q. Liu, “Former-dfer: Dynamic facial expression recog- nition transformer,” inProceedings of the 29th ACM international conference on multimedia, 2021, pp. 1553–1561

  40. [40]

    A 3Lign-DFER: Pioneering comprehensive dynamic affective alignment for dynamic facial expression recognition with clip,

    Z. Tao, Y . Wang, J. Lin, H. Wang, X. Mai, J. Yu, X. Tong, Z. Zhou, S. Yan, Q. Zhaoet al., “A 3Lign-DFER: Pioneering comprehensive dynamic affective alignment for dynamic facial expression recognition with clip,”arXiv preprint arXiv:2403.04294, 2024

  41. [41]

    Clip- guided bidirectional prompt and semantic supervision for dynamic facial expression recognition,

    J. Zhang, X. Liu, Y . Liang, X. Xian, W. Xie, L. Shen, and S. Song, “Clip- guided bidirectional prompt and semantic supervision for dynamic facial expression recognition,” in2024 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 2024, pp. 1–10

  42. [42]

    Finecliper: Multi- modal fine-grained clip for dynamic facial expression recognition with adapters,

    H. Chen, H. Huang, J. Dong, M. Zheng, and D. Shao, “Finecliper: Multi- modal fine-grained clip for dynamic facial expression recognition with adapters,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2301–2310

  43. [43]

    Clvsr: Concept-guided language- visual feature learning and sample rebalance for dynamic facial expres- sion recognition,

    Z. Liang, H. Xia, Y . Tan, and S. Song, “Clvsr: Concept-guided language- visual feature learning and sample rebalance for dynamic facial expres- sion recognition,”Cognitive Computation, vol. 18, no. 1, p. 11, 2026

  44. [44]

    Cliper: A unified vision-language framework for in-the-wild facial expression recognition,

    H. Li, H. Niu, Z. Zhu, and F. Zhao, “Cliper: A unified vision-language framework for in-the-wild facial expression recognition,” in2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024, pp. 1–6

  45. [45]

    Prompting visual-language models for dynamic facial expression recognition,

    Z. Zhao and I. Patras, “Prompting visual-language models for dynamic facial expression recognition,” inBritish Machine Vision Conference (BMVC), 2023, pp. 1–14

  46. [46]

    Pe-clip: A parameter-efficient fine-tuning of vision lan- guage models for dynamic facial expression recognition,

    I. Saadi, A. Hadid, D. W. Cunningham, A. Taleb-Ahmed, and Y . El Hillali, “Pe-clip: A parameter-efficient fine-tuning of vision lan- guage models for dynamic facial expression recognition,”ACM Trans- actions on Multimedia Computing, Communications and Applications, 2025

  47. [47]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  48. [48]

    Mae-dfer: Efficient masked au- toencoder for self-supervised dynamic facial expression recognition,

    L. Sun, Z. Lian, B. Liu, and J. Tao, “Mae-dfer: Efficient masked au- toencoder for self-supervised dynamic facial expression recognition,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6110–6121

  49. [49]

    Vaemo: Efficient representation learning for visual-audio emotion with knowledge injection,

    H. Cheng, Z. Zhao, Y . He, Z. Hu, J. Li, M. Wang, and R. Hong, “Vaemo: Efficient representation learning for visual-audio emotion with knowledge injection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 5547–5556

  50. [50]

    From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos,

    Y . Chen, J. Li, S. Shan, M. Wang, and R. Hong, “From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos,”IEEE Transactions on Affective Computing, vol. 16, no. 2, pp. 624–638, 2024

  51. [51]

    Static for dynamic: Towards a deeper understanding of dynamic facial expressions using static expression data,

    Y . Chen, J. Li, Y . Zhang, Z. Hu, S. Shan, M. Wang, and R. Hong, “Static for dynamic: Towards a deeper understanding of dynamic facial expressions using static expression data,”IEEE Transactions on Affective Computing, vol. 17, no. 1, pp. 438–451, 2025

  52. [52]

    Hybrid message passing with performance-driven structures for facial action unit detection,

    T. Song, Z. Cui, W. Zheng, and Q. Ji, “Hybrid message passing with performance-driven structures for facial action unit detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6267–6276

  53. [53]

    Piap-df: Pixel-interested and anti person-specific facial action unit detection net with discrete feedback learning,

    Y . Tang, W. Zeng, D. Zhao, and H. Zhang, “Piap-df: Pixel-interested and anti person-specific facial action unit detection net with discrete feedback learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 899–12 908

  54. [54]

    Semantic relationships guided representation learning for facial action unit recognition,

    G. Li, X. Zhu, Y . Zeng, Q. Wang, and L. Lin, “Semantic relationships guided representation learning for facial action unit recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8594–8601

  55. [55]

    Learning multi- dimensional edge feature-based au relation graph for facial action unit recognition,

    C. Luo, S. Song, W. Xie, L. Shen, and H. Gunes, “Learning multi- dimensional edge feature-based au relation graph for facial action unit recognition,” inProceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), 2022, pp. 1239–1246

  56. [56]

    Facial au recognition with feature-based au localization and confidence-based relation mining,

    Z. Huang, J. Gao, W. Cai, Y . Chen, X. Hu, P. Gao, and Y . Gao, “Facial au recognition with feature-based au localization and confidence-based relation mining,”IEEE Transactions on Affective Computing, vol. 17, no. 1, pp. 616–629, 2025

  57. [57]

    Causalaffect: Causal discovery for facial affective understanding,

    G. Hu, T. Lian, D. Kollias, O. Celiktutan, and X. Yang, “Causalaffect: Causal discovery for facial affective understanding,”arXiv preprint arXiv:2512.00456, 2025

  58. [58]

    Revisiting representation learning and identity adversarial training for facial behavior understand- ing,

    M. Ning, A. A. Salah, and I. O. Ertugrul, “Revisiting representation learning and identity adversarial training for facial behavior understand- ing,” in2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 2025, pp. 1–10. 15

  59. [59]

    Facial action unit detection and intensity estimation from self- supervised representation,

    B. Ma, R. An, W. Zhang, Y . Ding, Z. Zhao, R. Zhang, T. Lv, C. Fan, and Z. Hu, “Facial action unit detection and intensity estimation from self- supervised representation,”IEEE Transactions on Affective Computing, vol. 15, no. 3, pp. 1669–1683, 2024

  60. [60]

    Uncertain graph neural networks for facial action unit detection,

    T. Song, L. Chen, W. Zheng, and Q. Ji, “Uncertain graph neural networks for facial action unit detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 7, 2021, pp. 5993–6001

  61. [61]

    Adaptive multimodal fusion for facial action units recognition,

    H. Yang, T. Wang, and L. Yin, “Adaptive multimodal fusion for facial action units recognition,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2982–2990

  62. [62]

    Multi-modal learning for au detection based on multi-head fused transformers,

    X. Zhang and L. Yin, “Multi-modal learning for au detection based on multi-head fused transformers,” in2021 16th IEEE international conference on automatic face and gesture recognition (FG 2021). IEEE, 2021, pp. 1–8

  63. [63]

    Disagreement matters: Exploring internal diversification for redundant attention in generic facial action analysis,

    X. Li, Z. Zhang, X. Zhang, T. Wang, Z. Li, H. Yang, U. Ciftci, Q. Ji, J. Cohn, and L. Yin, “Disagreement matters: Exploring internal diversification for redundant attention in generic facial action analysis,” IEEE Transactions on Affective Computing, vol. 15, no. 2, pp. 620–631, 2023

  64. [64]

    Weakly-supervised text-driven contrastive learning for facial behavior understanding,

    X. Zhang, T. Wang, X. Li, H. Yang, and L. Yin, “Weakly-supervised text-driven contrastive learning for facial behavior understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 751–20 762

  65. [65]

    Hierarchical vision-language interaction for facial action unit detection,

    Y . Li, Y . Ren, Y . Zhang, W. Zhang, T. Zhang, M. Jiang, G.-S. Xie, and C. Guan, “Hierarchical vision-language interaction for facial action unit detection,”IEEE Transactions on Affective Computing, 2026

  66. [66]

    Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,

    D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yuet al., “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1280–1297

  67. [67]

    Adamv-moe: Adaptive multi-task vision mixture-of-experts,

    T. Chen, X. Chen, X. Du, A. Rashwan, F. Yang, H. Chen, Z. Wang, and Y . Li, “Adamv-moe: Adaptive multi-task vision mixture-of-experts,” inproceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 17 346–17 357

  68. [68]

    Learning to prompt for vision- language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022

  69. [69]

    Knowledge-spreader: Learning semi-supervised facial action dynamics by consistifying knowledge granularity,

    X. Li, X. Zhang, T. Wang, and L. Yin, “Knowledge-spreader: Learning semi-supervised facial action dynamics by consistifying knowledge granularity,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 20 979–20 989

  70. [70]

    Multi-scale dynamic and hierarchical relationship modeling for facial action units recognition,

    Z. Wang, S. Song, C. Luo, S. Deng, W. Xie, and L. Shen, “Multi-scale dynamic and hierarchical relationship modeling for facial action units recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1270–1280

  71. [71]

    Auformer: Vision transformers are parameter-efficient facial action unit detectors,

    K. Yuan, Z. Yu, X. Liu, W. Xie, H. Yue, and J. Yang, “Auformer: Vision transformers are parameter-efficient facial action unit detectors,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 427– 445

  72. [72]

    Au-ttt: Vision test-time training model for facial action unit detection,

    B. Xing, K. Yuan, Z. Yu, X. Liu, and H. K ¨alvi¨ainen, “Au-ttt: Vision test-time training model for facial action unit detection,” in2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2025, pp. 1–6

  73. [73]

    Nr-dfernet: Noise-robust net- work for dynamic facial expression recognition,

    H. Li, M. Sui, Z. Zhuet al., “Nr-dfernet: Noise-robust network for dy- namic facial expression recognition,”arXiv preprint arXiv:2206.04975, 2022

  74. [74]

    Freq-hd: An interpretable frequency-based high-dynamics affective clip selection method for in-the-wild facial expression recogni- tion in videos,

    Z. Tao, Y . Wang, Z. Chen, B. Wang, S. Yan, K. Jiang, S. Gao, and W. Zhang, “Freq-hd: An interpretable frequency-based high-dynamics affective clip selection method for in-the-wild facial expression recogni- tion in videos,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 843–852

  75. [75]

    Rethink- ing the learning paradigm for dynamic facial expression recognition,

    H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, and A. Zhou, “Rethink- ing the learning paradigm for dynamic facial expression recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 958–17 968

  76. [76]

    Svfap: Self-supervised video facial affect perceiver,

    L. Sun, Z. Lian, K. Wang, Y . He, M. Xu, H. Sun, B. Liu, and J. Tao, “Svfap: Self-supervised video facial affect perceiver,”IEEE Transactions on Affective Computing, 2024

  77. [77]

    Emoclip: A vision-language method for zero-shot video facial expression recognition,

    N. M. Foteinopoulou and I. Patras, “Emoclip: A vision-language method for zero-shot video facial expression recognition,” in2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 2024, pp. 1–10

  78. [78]

    Dflm: A dynamic facial-language model based on clip,

    Y . Han and Q. Li, “Dflm: A dynamic facial-language model based on clip,” in2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP). IEEE, 2024, pp. 1132–1137

  79. [79]

    Ous: scene-guided dynamic facial expression recognition,

    X. Mai, H. Wang, Z. Tao, J. Lin, S. Yan, Y . Wang, J. Liu, J. Yu, X. Tong, Y . Liet al., “Ous: scene-guided dynamic facial expression recognition,” arXiv preprint arXiv:2405.18769, 2024

  80. [80]

    Quantifying attention flow in transformers,

    S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 4190–4197. 1 APPENDIX A. Data Scaling Study Tables S1 and S2 investigate how the scale of auxiliary-task data affects the target task from two opposite directions. The former correspond...