Recognition: unknown
Bidirectional Learning of Facial Action Units and Expressions via Structured Semantic Mapping across Heterogeneous Datasets
Pith reviewed 2026-05-10 16:06 UTC · model grok-4.3
The pith
Structured semantic mapping enables bidirectional learning of facial action units and expressions across heterogeneous datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Structured Semantic Mapping (SSM) framework achieves state-of-the-art performance on both AU detection and FE recognition benchmarks simultaneously by using a shared visual backbone for unified representations, a Textual Semantic Prototype module that builds structured semantic anchors from fixed textual descriptions plus learnable prompts, and a Dynamic Prior Mapping module that learns bidirectional associations in feature space, thereby showing that holistic expression semantics enhance fine-grained AU learning even across heterogeneous datasets.
What carries the argument
Structured Semantic Mapping (SSM) framework, which uses Textual Semantic Prototype (TSP) modules as cross-task alignment anchors in semantic space and Dynamic Prior Mapping (DPM) to enable explicit bidirectional knowledge transfer via a data-driven association matrix grounded in FACS priors.
Load-bearing premise
Fixed textual descriptions of action units and expressions, augmented with learnable context prompts, can reliably serve as supervision signals and cross-task alignment anchors without semantic drift or dataset-specific biases.
What would settle it
If ablation studies show that removing the TSP alignment or DPM bidirectional mapping produces no gains (or losses) on standard AU and FE benchmarks compared to separate or unidirectional training, the central claim would be falsified.
Figures
read the original abstract
Facial action unit (AU) detection and facial expression (FE) recognition can be jointly viewed as affective facial behavior tasks, representing fine-grained muscular activations and coarse-grained holistic affective states, respectively. Despite their inherent semantic correlation, existing studies predominantly focus on knowledge transfer from AUs to FEs, while bidirectional learning remains insufficiently explored. In practice, this challenge is further compounded by heterogeneous data conditions, where AU and FE datasets differ in annotation paradigms (frame-level vs.\ clip-level), label granularity, and data availability and diversity, hindering effective joint learning. To address these issues, we propose a Structured Semantic Mapping (SSM) framework for bidirectional AU--FE learning under different data domains and heterogeneous supervision. SSM consists of three key components: (1) a shared visual backbone that learns unified facial representations from dynamic AU and FE videos; (2) semantic mediation via a Textual Semantic Prototype (TSP) module, which constructs structured semantic prototypes from fixed textual descriptions augmented with learnable context prompts, serving as supervision signals and cross-task alignment anchors in a shared semantic space; and (3) a Dynamic Prior Mapping (DPM) module that incorporates prior knowledge derived from the Facial Action Coding System and learns a data-driven association matrix in a high-level feature space, enabling explicit and bidirectional knowledge transfer. Extensive experiments on popular AU detection and FE recognition benchmarks show that SSM achieves state-of-the-art performance on both tasks simultaneously, and demonstrate that holistic expression semantics can in turn enhance fine-grained AU learning even across heterogeneous datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a Structured Semantic Mapping (SSM) framework for bidirectional learning of facial action unit (AU) detection and facial expression (FE) recognition across heterogeneous datasets. SSM includes a shared visual backbone for unified representations from AU and FE videos, a Textual Semantic Prototype (TSP) module that builds structured semantic prototypes from fixed textual AU/FE descriptions augmented by learnable context prompts to act as supervision and cross-task alignment anchors, and a Dynamic Prior Mapping (DPM) module that uses FACS-derived priors to learn a data-driven association matrix for explicit bidirectional knowledge transfer. The central claim is that this enables simultaneous state-of-the-art performance on both tasks and shows holistic FE semantics enhancing fine-grained AU learning despite differing annotation granularities and data domains.
Significance. If the empirical claims hold, the work would be significant for addressing the underexplored bidirectional direction in affective computing and for handling heterogeneous supervision (frame-level vs. clip-level annotations) via semantic mediation rather than one-way transfer. The TSP and DPM components represent a structured way to leverage external FACS knowledge alongside learnable elements, potentially generalizing to other multi-granularity vision tasks.
major comments (2)
- [Abstract / TSP module] Abstract and TSP module description: The central claim that fixed textual descriptions of AUs and FEs augmented with learnable context prompts reliably serve as supervision signals and cross-task alignment anchors without semantic drift is load-bearing for bidirectional transfer across heterogeneous datasets, yet no analysis, stability metrics, or ablation on prompt overfitting to dataset-specific cues is referenced.
- [Experiments section] DPM module and experiments: The assertion of SOTA performance on both AU detection and FE recognition benchmarks simultaneously, plus bidirectional enhancement, requires quantitative metrics, ablation studies isolating TSP/DPM contributions, and error analysis on cross-dataset transfer; these are not supplied in the result summary, leaving the empirical grounding for the holistic-to-fine-grained enhancement claim unverified.
minor comments (1)
- [Abstract] The abstract would benefit from including at least one key performance number (e.g., average F1 or accuracy improvement) to substantiate the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below and have revised the manuscript to provide the requested analyses and metrics.
read point-by-point responses
-
Referee: [Abstract / TSP module] Abstract and TSP module description: The central claim that fixed textual descriptions of AUs and FEs augmented with learnable context prompts reliably serve as supervision signals and cross-task alignment anchors without semantic drift is load-bearing for bidirectional transfer across heterogeneous datasets, yet no analysis, stability metrics, or ablation on prompt overfitting to dataset-specific cues is referenced.
Authors: We agree that explicit analysis of prompt stability and potential semantic drift strengthens the TSP module claims. The fixed textual descriptions are intended to provide stable semantic anchors derived from standard AU/FE definitions, with learnable context prompts enabling data-driven adaptation. In the revised manuscript we have added a new analysis subsection (Section 4.3) containing: (i) stability metrics (standard deviation of performance across five random prompt initializations), (ii) an ablation comparing fixed-only versus augmented prompts, and (iii) cross-dataset consistency checks showing that performance gains do not degrade when prompts are transferred between heterogeneous datasets. These additions confirm that the learnable prompts improve alignment without introducing measurable semantic drift or dataset-specific overfitting. revision: yes
-
Referee: [Experiments section] DPM module and experiments: The assertion of SOTA performance on both AU detection and FE recognition benchmarks simultaneously, plus bidirectional enhancement, requires quantitative metrics, ablation studies isolating TSP/DPM contributions, and error analysis on cross-dataset transfer; these are not supplied in the result summary, leaving the empirical grounding for the holistic-to-fine-grained enhancement claim unverified.
Authors: The original manuscript reports simultaneous SOTA results on standard AU and FE benchmarks together with overall framework ablations. We acknowledge that more granular isolation of TSP versus DPM contributions and explicit error analysis on cross-dataset transfer would better substantiate the bidirectional enhancement claim. In the revised version we have expanded Section 5 with: (i) separate ablation tables quantifying the incremental contribution of TSP and DPM, (ii) additional quantitative metrics (F1, accuracy, and AUC) for both tasks under bidirectional versus unidirectional settings, and (iii) error analysis tables breaking down cross-dataset transfer performance by AU/FE category. These revisions directly support the claim that holistic FE semantics improve fine-grained AU detection across heterogeneous data. revision: yes
Circularity Check
No circularity: framework design relies on external FACS priors and empirical training
full rationale
The SSM framework introduces TSP (fixed textual AU/FE descriptions + learnable prompts as supervision anchors) and DPM (FACS-derived prior knowledge plus learned association matrix for bidirectional transfer). These are architectural choices trained end-to-end on heterogeneous datasets; performance claims are validated empirically on standard benchmarks rather than derived by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained against external FACS knowledge and data-driven optimization.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable context prompts
axioms (1)
- domain assumption The Facial Action Coding System supplies reliable prior knowledge that can be encoded as an association matrix for bidirectional AU-FE transfer.
invented entities (2)
-
Textual Semantic Prototype (TSP) module
no independent evidence
-
Dynamic Prior Mapping (DPM) module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Constants across cultures in the face and emotion
P. Ekman and W. V . Friesen, “Constants across cultures in the face and emotion.”Journal of personality and social psychology, vol. 17, no. 2, p. 124, 1971
1971
-
[2]
More evidence for the universality of a contempt expression,
D. Matsumoto, “More evidence for the universality of a contempt expression,”Motivation and Emotion, vol. 16, no. 4, pp. 363–368, 1992
1992
-
[3]
Facial action coding system,
P. Ekman and W. V . Friesen, “Facial action coding system,”Environ- mental Psychology & Nonverbal Behavior, 1978
1978
-
[4]
Facial expression recognition by de- expression residue learning,
H. Yang, U. Ciftci, and L. Yin, “Facial expression recognition by de- expression residue learning,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2168–2177
2018
-
[5]
Spontaneous facial expression analysis based on tem- perature changes and head motions,
P. Liu and L. Yin, “Spontaneous facial expression analysis based on tem- perature changes and head motions,” in2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1. IEEE, 2015, pp. 1–6
2015
-
[6]
Deep disturbance- disentangled learning for facial expression recognition,
D. Ruan, Y . Yan, S. Chen, J.-H. Xue, and H. Wang, “Deep disturbance- disentangled learning for facial expression recognition,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2833–2841
2020
-
[7]
Dive into ambiguity: Latent distribution mining and pairwise uncertainty estima- tion for facial expression recognition,
J. She, Y . Hu, H. Shi, J. Wang, Q. Shen, and T. Mei, “Dive into ambiguity: Latent distribution mining and pairwise uncertainty estima- tion for facial expression recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6248– 6257
2021
-
[8]
Exploiting semantic embedding and visual feature for facial action unit detection,
H. Yang, L. Yin, Y . Zhou, and J. Gu, “Exploiting semantic embedding and visual feature for facial action unit detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 10 482–10 491
2021
-
[9]
Facial action unit detection with transform- ers,
G. M. Jacob and B. Stenger, “Facial action unit detection with transform- ers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7680–7689
2021
-
[10]
Knowledge-driven self-supervised representa- tion learning for facial action unit recognition,
Y . Chang and S. Wang, “Knowledge-driven self-supervised representa- tion learning for facial action unit recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 417–20 426
2022
-
[11]
Compound expression recognition in-the-wild with au-assisted meta multi-task learning,
X. Li, W. Deng, S. Li, and Y . Li, “Compound expression recognition in-the-wild with au-assisted meta multi-task learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5735–5744
2023
-
[12]
Multi-label compound expression recognition: C-expr database & network,
D. Kollias, “Multi-label compound expression recognition: C-expr database & network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5589–5598
2023
-
[13]
Enhanced facial expression recognition based on facial action unit intensity and region,
W. Chen and A. Wang, “Enhanced facial expression recognition based on facial action unit intensity and region,” in2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2023, pp. 1939–1944
2023
-
[14]
Au-aware vision transformers for biased facial expression recognition,
S. Mao, X. Li, Q. Wu, and X. Peng, “Au-aware vision transformers for biased facial expression recognition,”arXiv preprint arXiv:2211.06609, 2022
-
[15]
Recognizing action units for facial expression analysis,
Y .-I. Tian, T. Kanade, and J. F. Cohn, “Recognizing action units for facial expression analysis,”IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 2, pp. 97–115, 2001
2001
-
[16]
Action unit enhance dynamic facial expression recognition,
F. Liu, L. Gu, C. Shi, and X. Fu, “Action unit enhance dynamic facial expression recognition,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 5597–5606
2025
-
[17]
A unified approach to facial affect analysis: the mae-face visual representation,
B. Ma, W. Zhang, F. Qiu, and Y . Ding, “A unified approach to facial affect analysis: the mae-face visual representation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5924–5933
2023
-
[18]
An effective ensemble learning framework for affective behaviour analysis,
W. Zhang, F. Qiu, C. Liu, L. Li, H. Du, T. Guo, and X. Yu, “An effective ensemble learning framework for affective behaviour analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4761–4772. 14
2024
-
[19]
Advanced facial analysis in multi-modal data with cascaded cross-attention based transformer,
J.-H. Kim, N. Kim, M. Hong, and C. S. Won, “Advanced facial analysis in multi-modal data with cascaded cross-attention based transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7870–7877
2024
-
[20]
Expression, affect, action unit recognition: Aff-Wild2, multi-task learning and ArcFace,
D. Kollias and S. Zafeiriou, “Expression, affect, action unit recog- nition: Aff-wild2, multi-task learning and arcface,”arXiv preprint arXiv:1910.04855, 2019
-
[21]
Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges,
D. Kollias, “Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2328–2336
2022
-
[22]
Prior aided streaming network for multi-task affective analysis,
W. Zhang, Z. Guo, K. Chen, L. Li, Z. Zhang, Y . Ding, R. Wu, T. Lv, and C. Fan, “Prior aided streaming network for multi-task affective analysis,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3539–3549
2021
-
[23]
Mtmsn: Multi-task and multi- modal sequence network for facial action unit and expression recog- nition,
Y . Jin, T. Zheng, C. Gao, and G. Xu, “Mtmsn: Multi-task and multi- modal sequence network for facial action unit and expression recog- nition,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3597–3602
2021
-
[24]
Multi-task learning for human affect prediction with auditory-visual synchronized representation,
E. Jeong, G. Oh, and S. Lim, “Multi-task learning for human affect prediction with auditory-visual synchronized representation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2438–2445
2022
-
[25]
Transformer-based multimodal information fusion for facial expression analysis,
W. Zhang, F. Qiu, S. Wang, H. Zeng, Z. Zhang, R. An, B. Ma, and Y . Ding, “Transformer-based multimodal information fusion for facial expression analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2428–2437
2022
-
[26]
A. V . Savchenko, “Hsemotion team at the 7th abaw challenge: multi-task learning and compound facial expression recognition,”arXiv preprint arXiv:2407.13184, 2024
-
[27]
Fedhca2: Towards hetero-client federated multi-task learning,
Y . Lu, S. Huang, Y . Yang, S. Sirejiding, Y . Ding, and H. Lu, “Fedhca2: Towards hetero-client federated multi-task learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5599–5609
2024
-
[28]
Het- erogeneous transfer learning: recent developments, applications, and challenges,
S. Khan, P. Yin, Y . Guo, M. Asim, and A. A. Abd El-Latif, “Het- erogeneous transfer learning: recent developments, applications, and challenges,”Multimedia Tools and Applications, vol. 83, no. 27, pp. 69 759–69 795, 2024
2024
-
[29]
Damex: Dataset-aware mixture- of-experts for visual understanding of mixture-of-datasets,
Y . Jain, H. Behl, Z. Kira, and V . Vineet, “Damex: Dataset-aware mixture- of-experts for visual understanding of mixture-of-datasets,”Advances in Neural Information Processing Systems, vol. 36, pp. 69 625–69 637, 2023
2023
-
[30]
Dfew: A large-scale database for recognizing dynamic facial expressions in the wild,
X. Jiang, Y . Zong, W. Zheng, C. Tang, W. Xia, C. Lu, and J. Liu, “Dfew: A large-scale database for recognizing dynamic facial expressions in the wild,” inProceedings of the 28th ACM international conference on multimedia, 2020, pp. 2881–2889
2020
-
[31]
Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild,
Y . Liu, W. Dai, C. Feng, W. Wang, G. Yin, J. Zeng, and S. Shan, “Mafw: A large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild,” inProceedings of the 30th ACM international conference on multimedia, 2022, pp. 24–32
2022
-
[32]
Ferv39k: A large-scale multi-scene dataset for facial expres- sion recognition in videos,
Y . Wang, Y . Sun, Y . Huang, Z. Liu, S. Gao, W. Zhang, W. Ge, and W. Zhang, “Ferv39k: A large-scale multi-scene dataset for facial expres- sion recognition in videos,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 922–20 931
2022
-
[33]
Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,
X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard, “Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,”Image and Vision Computing, vol. 32, no. 10, pp. 692–706, 2014
2014
-
[34]
Disfa: A spontaneous facial action intensity database,
S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn, “Disfa: A spontaneous facial action intensity database,”IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 151–160, 2013
2013
-
[35]
Logo-former: Local-global spatio-temporal transformer for dynamic facial expression recognition,
F. Ma, B. Sun, and S. Li, “Logo-former: Local-global spatio-temporal transformer for dynamic facial expression recognition,” inICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
2023
-
[36]
Intensity-aware loss for dynamic facial expression recognition in the wild,
H. Li, H. Niu, Z. Zhu, and F. Zhao, “Intensity-aware loss for dynamic facial expression recognition in the wild,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 1, 2023, pp. 67–75
2023
-
[37]
Ex- pression snippet transformer for robust video-based facial expression recognition,
Y . Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y . Zhan, “Ex- pression snippet transformer for robust video-based facial expression recognition,”Pattern Recognition, vol. 138, p. 109368, 2023
2023
-
[38]
Lifting scheme-based implicit disentanglement of emotion-related facial dynamics in the wild,
X. Wang and L. Chai, “Lifting scheme-based implicit disentanglement of emotion-related facial dynamics in the wild,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 7970–7978
2025
-
[39]
Former-dfer: Dynamic facial expression recog- nition transformer,
Z. Zhao and Q. Liu, “Former-dfer: Dynamic facial expression recog- nition transformer,” inProceedings of the 29th ACM international conference on multimedia, 2021, pp. 1553–1561
2021
-
[40]
Z. Tao, Y . Wang, J. Lin, H. Wang, X. Mai, J. Yu, X. Tong, Z. Zhou, S. Yan, Q. Zhaoet al., “A 3Lign-DFER: Pioneering comprehensive dynamic affective alignment for dynamic facial expression recognition with clip,”arXiv preprint arXiv:2403.04294, 2024
-
[41]
Clip- guided bidirectional prompt and semantic supervision for dynamic facial expression recognition,
J. Zhang, X. Liu, Y . Liang, X. Xian, W. Xie, L. Shen, and S. Song, “Clip- guided bidirectional prompt and semantic supervision for dynamic facial expression recognition,” in2024 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 2024, pp. 1–10
2024
-
[42]
Finecliper: Multi- modal fine-grained clip for dynamic facial expression recognition with adapters,
H. Chen, H. Huang, J. Dong, M. Zheng, and D. Shao, “Finecliper: Multi- modal fine-grained clip for dynamic facial expression recognition with adapters,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2301–2310
2024
-
[43]
Clvsr: Concept-guided language- visual feature learning and sample rebalance for dynamic facial expres- sion recognition,
Z. Liang, H. Xia, Y . Tan, and S. Song, “Clvsr: Concept-guided language- visual feature learning and sample rebalance for dynamic facial expres- sion recognition,”Cognitive Computation, vol. 18, no. 1, p. 11, 2026
2026
-
[44]
Cliper: A unified vision-language framework for in-the-wild facial expression recognition,
H. Li, H. Niu, Z. Zhu, and F. Zhao, “Cliper: A unified vision-language framework for in-the-wild facial expression recognition,” in2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024, pp. 1–6
2024
-
[45]
Prompting visual-language models for dynamic facial expression recognition,
Z. Zhao and I. Patras, “Prompting visual-language models for dynamic facial expression recognition,” inBritish Machine Vision Conference (BMVC), 2023, pp. 1–14
2023
-
[46]
Pe-clip: A parameter-efficient fine-tuning of vision lan- guage models for dynamic facial expression recognition,
I. Saadi, A. Hadid, D. W. Cunningham, A. Taleb-Ahmed, and Y . El Hillali, “Pe-clip: A parameter-efficient fine-tuning of vision lan- guage models for dynamic facial expression recognition,”ACM Trans- actions on Multimedia Computing, Communications and Applications, 2025
2025
-
[47]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
2021
-
[48]
Mae-dfer: Efficient masked au- toencoder for self-supervised dynamic facial expression recognition,
L. Sun, Z. Lian, B. Liu, and J. Tao, “Mae-dfer: Efficient masked au- toencoder for self-supervised dynamic facial expression recognition,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6110–6121
2023
-
[49]
Vaemo: Efficient representation learning for visual-audio emotion with knowledge injection,
H. Cheng, Z. Zhao, Y . He, Z. Hu, J. Li, M. Wang, and R. Hong, “Vaemo: Efficient representation learning for visual-audio emotion with knowledge injection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 5547–5556
2025
-
[50]
From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos,
Y . Chen, J. Li, S. Shan, M. Wang, and R. Hong, “From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos,”IEEE Transactions on Affective Computing, vol. 16, no. 2, pp. 624–638, 2024
2024
-
[51]
Static for dynamic: Towards a deeper understanding of dynamic facial expressions using static expression data,
Y . Chen, J. Li, Y . Zhang, Z. Hu, S. Shan, M. Wang, and R. Hong, “Static for dynamic: Towards a deeper understanding of dynamic facial expressions using static expression data,”IEEE Transactions on Affective Computing, vol. 17, no. 1, pp. 438–451, 2025
2025
-
[52]
Hybrid message passing with performance-driven structures for facial action unit detection,
T. Song, Z. Cui, W. Zheng, and Q. Ji, “Hybrid message passing with performance-driven structures for facial action unit detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6267–6276
2021
-
[53]
Piap-df: Pixel-interested and anti person-specific facial action unit detection net with discrete feedback learning,
Y . Tang, W. Zeng, D. Zhao, and H. Zhang, “Piap-df: Pixel-interested and anti person-specific facial action unit detection net with discrete feedback learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 899–12 908
2021
-
[54]
Semantic relationships guided representation learning for facial action unit recognition,
G. Li, X. Zhu, Y . Zeng, Q. Wang, and L. Lin, “Semantic relationships guided representation learning for facial action unit recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8594–8601
2019
-
[55]
Learning multi- dimensional edge feature-based au relation graph for facial action unit recognition,
C. Luo, S. Song, W. Xie, L. Shen, and H. Gunes, “Learning multi- dimensional edge feature-based au relation graph for facial action unit recognition,” inProceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), 2022, pp. 1239–1246
2022
-
[56]
Facial au recognition with feature-based au localization and confidence-based relation mining,
Z. Huang, J. Gao, W. Cai, Y . Chen, X. Hu, P. Gao, and Y . Gao, “Facial au recognition with feature-based au localization and confidence-based relation mining,”IEEE Transactions on Affective Computing, vol. 17, no. 1, pp. 616–629, 2025
2025
-
[57]
Causalaffect: Causal discovery for facial affective understanding,
G. Hu, T. Lian, D. Kollias, O. Celiktutan, and X. Yang, “Causalaffect: Causal discovery for facial affective understanding,”arXiv preprint arXiv:2512.00456, 2025
-
[58]
Revisiting representation learning and identity adversarial training for facial behavior understand- ing,
M. Ning, A. A. Salah, and I. O. Ertugrul, “Revisiting representation learning and identity adversarial training for facial behavior understand- ing,” in2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 2025, pp. 1–10. 15
2025
-
[59]
Facial action unit detection and intensity estimation from self- supervised representation,
B. Ma, R. An, W. Zhang, Y . Ding, Z. Zhao, R. Zhang, T. Lv, C. Fan, and Z. Hu, “Facial action unit detection and intensity estimation from self- supervised representation,”IEEE Transactions on Affective Computing, vol. 15, no. 3, pp. 1669–1683, 2024
2024
-
[60]
Uncertain graph neural networks for facial action unit detection,
T. Song, L. Chen, W. Zheng, and Q. Ji, “Uncertain graph neural networks for facial action unit detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 7, 2021, pp. 5993–6001
2021
-
[61]
Adaptive multimodal fusion for facial action units recognition,
H. Yang, T. Wang, and L. Yin, “Adaptive multimodal fusion for facial action units recognition,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2982–2990
2020
-
[62]
Multi-modal learning for au detection based on multi-head fused transformers,
X. Zhang and L. Yin, “Multi-modal learning for au detection based on multi-head fused transformers,” in2021 16th IEEE international conference on automatic face and gesture recognition (FG 2021). IEEE, 2021, pp. 1–8
2021
-
[63]
Disagreement matters: Exploring internal diversification for redundant attention in generic facial action analysis,
X. Li, Z. Zhang, X. Zhang, T. Wang, Z. Li, H. Yang, U. Ciftci, Q. Ji, J. Cohn, and L. Yin, “Disagreement matters: Exploring internal diversification for redundant attention in generic facial action analysis,” IEEE Transactions on Affective Computing, vol. 15, no. 2, pp. 620–631, 2023
2023
-
[64]
Weakly-supervised text-driven contrastive learning for facial behavior understanding,
X. Zhang, T. Wang, X. Li, H. Yang, and L. Yin, “Weakly-supervised text-driven contrastive learning for facial behavior understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 751–20 762
2023
-
[65]
Hierarchical vision-language interaction for facial action unit detection,
Y . Li, Y . Ren, Y . Zhang, W. Zhang, T. Zhang, M. Jiang, G.-S. Xie, and C. Guan, “Hierarchical vision-language interaction for facial action unit detection,”IEEE Transactions on Affective Computing, 2026
2026
-
[66]
Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yuet al., “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1280–1297
2024
-
[67]
Adamv-moe: Adaptive multi-task vision mixture-of-experts,
T. Chen, X. Chen, X. Du, A. Rashwan, F. Yang, H. Chen, Z. Wang, and Y . Li, “Adamv-moe: Adaptive multi-task vision mixture-of-experts,” inproceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 17 346–17 357
2023
-
[68]
Learning to prompt for vision- language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022
2022
-
[69]
Knowledge-spreader: Learning semi-supervised facial action dynamics by consistifying knowledge granularity,
X. Li, X. Zhang, T. Wang, and L. Yin, “Knowledge-spreader: Learning semi-supervised facial action dynamics by consistifying knowledge granularity,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 20 979–20 989
2023
-
[70]
Multi-scale dynamic and hierarchical relationship modeling for facial action units recognition,
Z. Wang, S. Song, C. Luo, S. Deng, W. Xie, and L. Shen, “Multi-scale dynamic and hierarchical relationship modeling for facial action units recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1270–1280
2024
-
[71]
Auformer: Vision transformers are parameter-efficient facial action unit detectors,
K. Yuan, Z. Yu, X. Liu, W. Xie, H. Yue, and J. Yang, “Auformer: Vision transformers are parameter-efficient facial action unit detectors,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 427– 445
2024
-
[72]
Au-ttt: Vision test-time training model for facial action unit detection,
B. Xing, K. Yuan, Z. Yu, X. Liu, and H. K ¨alvi¨ainen, “Au-ttt: Vision test-time training model for facial action unit detection,” in2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2025, pp. 1–6
2025
-
[73]
Nr-dfernet: Noise-robust net- work for dynamic facial expression recognition,
H. Li, M. Sui, Z. Zhuet al., “Nr-dfernet: Noise-robust network for dy- namic facial expression recognition,”arXiv preprint arXiv:2206.04975, 2022
-
[74]
Freq-hd: An interpretable frequency-based high-dynamics affective clip selection method for in-the-wild facial expression recogni- tion in videos,
Z. Tao, Y . Wang, Z. Chen, B. Wang, S. Yan, K. Jiang, S. Gao, and W. Zhang, “Freq-hd: An interpretable frequency-based high-dynamics affective clip selection method for in-the-wild facial expression recogni- tion in videos,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 843–852
2023
-
[75]
Rethink- ing the learning paradigm for dynamic facial expression recognition,
H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, and A. Zhou, “Rethink- ing the learning paradigm for dynamic facial expression recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 958–17 968
2023
-
[76]
Svfap: Self-supervised video facial affect perceiver,
L. Sun, Z. Lian, K. Wang, Y . He, M. Xu, H. Sun, B. Liu, and J. Tao, “Svfap: Self-supervised video facial affect perceiver,”IEEE Transactions on Affective Computing, 2024
2024
-
[77]
Emoclip: A vision-language method for zero-shot video facial expression recognition,
N. M. Foteinopoulou and I. Patras, “Emoclip: A vision-language method for zero-shot video facial expression recognition,” in2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 2024, pp. 1–10
2024
-
[78]
Dflm: A dynamic facial-language model based on clip,
Y . Han and Q. Li, “Dflm: A dynamic facial-language model based on clip,” in2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP). IEEE, 2024, pp. 1132–1137
2024
-
[79]
Ous: scene-guided dynamic facial expression recognition,
X. Mai, H. Wang, Z. Tao, J. Lin, S. Yan, Y . Wang, J. Liu, J. Yu, X. Tong, Y . Liet al., “Ous: scene-guided dynamic facial expression recognition,” arXiv preprint arXiv:2405.18769, 2024
-
[80]
Quantifying attention flow in transformers,
S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 4190–4197. 1 APPENDIX A. Data Scaling Study Tables S1 and S2 investigate how the scale of auxiliary-task data affects the target task from two opposite directions. The former correspond...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.