pith. machine review for the scientific record. sign in

arxiv: 2603.12221 · v2 · submitted 2026-03-12 · 💻 cs.CV

Recognition: no theorem link

A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords facial expression recognitiondual-modalityDINOv2Wav2VecABAWgated fusiontemporal smoothing
0
0 comments X

The pith

A two-stage dual-modal model using DINOv2 visual features and Wav2Vec audio features reaches a Macro-F1 of 0.5368 on the ABAW validation set for facial expression recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a two-stage dual-modal system to classify eight facial emotional expressions frame by frame from unconstrained videos. Stage one extracts visual features with a pretrained DINOv2 encoder, applies padding-aware augmentation, and uses a mixture-of-experts head. Stage two averages multi-scale visual crops, aligns them with Wav2Vec 2.0 audio features, fuses the modalities through a gated module, and applies temporal smoothing at inference. This pipeline produces the reported scores and exceeds the official baselines. A reader would care because reliable expression recognition under real-world conditions such as blur, pose changes, and scale variation supports downstream uses in human-computer interaction and behavioral analysis.

Core claim

The authors claim that robust frame-level visual representations obtained by averaging DINOv2 features from multi-scale re-crops, when combined with frame-aligned Wav2Vec audio features through a lightweight gated fusion module and followed by temporal smoothing, deliver a Macro-F1 score of 0.5368 on the official validation set and 0.5122 plus or minus 0.0277 under five-fold cross-validation, surpassing the provided baselines on the ABAW expression recognition task.

What carries the argument

The two-stage dual-modality pipeline that averages DINOv2 visual features across scales, fuses them with Wav2Vec audio features via a gated module, and applies inference-time temporal smoothing.

If this is right

  • The averaging of multi-scale visual features reduces sensitivity to pose and scale variation within individual frames.
  • Gated fusion allows the model to weigh acoustic cues when visual information is degraded by motion blur.
  • Temporal smoothing at inference improves consistency across adjacent frames without retraining.
  • The mixture-of-experts head in the visual stage increases classifier diversity for the eight expression classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feature averaging and gated fusion steps could be tested on other in-the-wild video datasets to check whether the performance gain holds when audio alignment differs.
  • If visual features alone already exceed the baseline, the contribution of the audio stream could be isolated by ablating the gated module on the same validation set.
  • The approach might extend to continuous emotion regression tasks by replacing the classification head with a regression output while retaining the dual-modal fusion.
  • Replacing the fixed temporal smoothing window with a learned recurrent layer could further reduce frame-to-frame label flips on longer video sequences.

Load-bearing premise

That the combination of pretrained DINOv2 and Wav2Vec features with averaging, gated fusion, and smoothing will continue to outperform baselines on the ABAW dataset without strong dependence on particular preprocessing steps or overfitting to its specific conditions.

What would settle it

Running the identical two-stage model on a fresh collection of unconstrained videos that share the same eight expression labels but differ in lighting, audio quality, or face localization statistics, and observing that the Macro-F1 falls below the official baseline values.

Figures

Figures reproduced from arXiv: 2603.12221 by Jiajun Sun, Zhe Gao.

Figure 1
Figure 1. Figure 1: Overview of the proposed model. For the video modality, three facial crops at different scales for each target frame are extracted [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The visual adaptation pipeline in Stage-I. The DINOv2- [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the proposed padding-aware augmenta [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Macro-averaged F1 score under different temporal [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a two-stage dual-modality model for frame-level facial expression recognition on the ABAW challenge dataset. Stage I extracts robust visual features using a pretrained DINOv2 ViT-L/14 backbone with padding-aware augmentation (PadAug) and a mixture-of-experts (MoE) head. Stage II performs multi-scale face averaging on visual features, extracts frame-aligned Wav2Vec 2.0 audio features, fuses them with a gated module, and applies inference-time temporal smoothing. The model reports a Macro-F1 of 0.5368 on the official validation set and 0.5122 ± 0.0277 under 5-fold cross-validation, outperforming the official baselines.

Significance. If the results hold, the work provides a competitive empirical demonstration of combining strong pretrained visual and audio encoders with simple fusion and smoothing for in-the-wild expression recognition. The use of 5-fold cross-validation with error bars and explicit baseline comparisons is a positive aspect of the evaluation. However, the lack of ablations and external validation limits insight into whether the two-stage design meaningfully advances beyond the pretrained backbones alone.

major comments (2)
  1. [Experiments] Experiments section: No ablation studies are reported that isolate the contribution of PadAug, the MoE head, multi-scale averaging, or the gated fusion module versus the DINOv2 and Wav2Vec backbones used in isolation. Without these controls, the headline Macro-F1 gains cannot be confidently attributed to the proposed architecture rather than the strength of the pretrained features.
  2. [Evaluation and results] Evaluation and results: The paper evaluates exclusively on the ABAW dataset with no tests on additional expression recognition benchmarks. This leaves open the possibility that performance is tied to ABAW-specific preprocessing, data characteristics, or the official validation split, weakening the claim that the method robustly addresses general challenges such as pose variation and motion blur.
minor comments (1)
  1. [Abstract] Abstract: Training procedures, hyperparameter choices, and the exact composition of the official baselines are not summarized, making it harder for readers to assess reproducibility from the abstract alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: No ablation studies are reported that isolate the contribution of PadAug, the MoE head, multi-scale averaging, or the gated fusion module versus the DINOv2 and Wav2Vec backbones used in isolation. Without these controls, the headline Macro-F1 gains cannot be confidently attributed to the proposed architecture rather than the strength of the pretrained features.

    Authors: We agree that ablation studies are necessary to isolate the contributions of each component. In the revised manuscript, we will add a dedicated ablation section on the ABAW validation set. This will include performance comparisons for the full model versus variants without PadAug, without the MoE head, without multi-scale averaging, and without the gated fusion module, as well as direct comparisons to the DINOv2 and Wav2Vec backbones used in isolation. These results will clarify the source of the observed Macro-F1 improvements. revision: yes

  2. Referee: [Evaluation and results] Evaluation and results: The paper evaluates exclusively on the ABAW dataset with no tests on additional expression recognition benchmarks. This leaves open the possibility that performance is tied to ABAW-specific preprocessing, data characteristics, or the official validation split, weakening the claim that the method robustly addresses general challenges such as pose variation and motion blur.

    Authors: We acknowledge that evaluation on additional benchmarks would strengthen claims of general robustness. Our submission targets the ABAW challenge specifically, where the dataset characteristics (including pose variation and motion blur) are central. In the revision, we will expand the discussion to explicitly address this limitation, clarify the focus on ABAW, and note that the 5-fold cross-validation with error bars and baseline comparisons provide evidence within this domain. We will also outline plans for future cross-dataset evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical model evaluation on public dataset

full rationale

The paper describes a two-stage audio-visual pipeline using pretrained DINOv2 and Wav2Vec backbones, custom augmentations, gated fusion, and temporal smoothing, then reports Macro-F1 scores on the ABAW validation set and 5-fold CV. No equations, derivations, or fitted parameters are presented whose outputs are later relabeled as predictions. All performance numbers are direct measurements on held-out competition data rather than quantities forced by self-definition or self-citation chains. The central claim therefore rests on external experimental outcomes, not on any reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The model relies on standard pretrained encoders and common fusion techniques without introducing new free parameters, axioms, or invented entities beyond those in the base models.

pith-pipeline@v0.9.0 · 5583 in / 1035 out tokens · 31436 ms · 2026-05-15T11:32:27.914434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

  1. [1]

    A review of affective computing: From unimodal analysis to multimodal fusion,

    S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,”Information Fusion, vol. 37, pp. 98–125, 2017. 1, 2

  2. [2]

    Deep facial expression recognition: A survey,

    S. Li and W. Deng, “Deep facial expression recognition: A survey,”IEEE Transactions on Affective Computing, vol. 13, pp. 1195–1215, July 2022

  3. [3]

    Advances in facial expression recognition: A survey of methods, benchmarks, models, and datasets,

    T. Kopalidis, V . Solachidis, N. Vretos, and P. Daras, “Advances in facial expression recognition: A survey of methods, benchmarks, models, and datasets,”Information, vol. 15, no. 3, p. 135, 2024. 1, 2

  4. [4]

    Expression, affect, action unit recognition: Aff-Wild2, multi-task learning and ArcFace,

    D. Kollias and S. Zafeiriou, “Expression, affect, action unit recognition: Aff-Wild2, multi-task learning and ArcFace,” arXiv preprint arXiv:1910.04855, 2019. 1, 2

  5. [5]

    Advancements in affective and behavior analysis: The 8th ABAW workshop and competition,

    D. Kollias, P. Tzirakis, A. Cowen, S. Zafeiriou, I. Kotsia, E. Granger, M. Pedersoli, S. Bacon, A. Baird, C. Gagne, et al., “Advancements in affective and behavior analysis: The 8th ABAW workshop and competition,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5572–5583, 2025. 1, 2, 6

  6. [6]

    From emotions to violence: Multimodal fine-grained be- havior analysis at the 9th ABAW,

    D. Kollias, S. Zafeiriou, I. Kotsia, G. Slabaugh, D. C. Senadeera, J. Zheng, K. K. K. Yadav, C. Shao, and G. Hu, “From emotions to violence: Multimodal fine-grained be- havior analysis at the 9th ABAW,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–12, 2025. 1, 2

  7. [7]

    10th workshop and competition on affective & behavior analysis in-the-Wild (ABAW)

    D. Kollias, S. Zafeiriou, I. Kotsia, P. Tzirakis, A. Cowen, E. Granger, M. Pedersoli, and S. Bacon, “10th workshop and competition on affective & behavior analysis in-the-Wild (ABAW).” Official workshop and competition website, in conjunction with IEEE/CVF CVPR 2026, 2026. Accessed: 2026-03-10. 1

  8. [8]

    ABAW: Facial expression recognition in the wild,

    D. Gera, B. N. S. Kumar, B. V . Raj Kumar, and S. Balasubra- manian, “ABAW: Facial expression recognition in the wild,” arXiv preprint arXiv:2303.09785, 2023. 1, 2

  9. [9]

    Coarse-to-fine cascaded networks with smooth predicting for video facial expression recognition,

    F. Xue, Z. Tan, Y . Zhu, Z. Ma, and G. Guo, “Coarse-to-fine cascaded networks with smooth predicting for video facial expression recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 2412–2418, June 2022. 1, 2, 3

  10. [10]

    EmotiEffNet and tem- poral convolutional networks in video-based facial expres- sion recognition and action unit detection,

    A. V . Savchenko and A. P. Sidorova, “EmotiEffNet and tem- poral convolutional networks in video-based facial expres- sion recognition and action unit detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, pp. 4849–4859, 2024. 2, 3

  11. [11]

    Exploring facial expression recognition through semi-supervised pre-training and temporal modeling,

    J. Yu, Z. Wei, Z. Cai, G. Zhao, Z. Zhang, Y . Wang, G. Xie, J. Zhu, W. Zhu, Q. Liu, and J. Liang, “Exploring facial expression recognition through semi-supervised pre-training and temporal modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4880–4887, 2024. 1, 2, 3

  12. [12]

    Transformer-based multimodal infor- mation fusion for facial expression analysis,

    W. Zhang, F. Qiu, S. Wang, H. Zeng, Z. Zhang, R. An, B. Ma, and Y . Ding, “Transformer-based multimodal infor- mation fusion for facial expression analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, pp. 2428–2437, 2022. 1, 2, 3

  13. [13]

    Sun team’s contribution to ABAW 2024 compe- tition: Audio-visual valence-arousal estimation and expres- sion recognition,

    D. Dresvyanskiy, M. Markitantov, J. Yu, P. Li, H. Kaya, and A. Karpov, “Sun team’s contribution to ABAW 2024 compe- tition: Audio-visual valence-arousal estimation and expres- sion recognition,”arXiv preprint arXiv:2403.12609, 2024. 2, 3

  14. [14]

    Ad- vanced facial analysis in multi-modal data with cascaded cross-attention based transformer,

    J.-H. Kim, N. Kim, M. Hong, and C. S. Won, “Ad- vanced facial analysis in multi-modal data with cascaded cross-attention based transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 7870–7877, 2024. 2, 3

  15. [15]

    Leveraging lightweight facial models and textual modality in audio-visual emotional understanding in-the-wild,

    A. Savchenko and L. Savchenko, “Leveraging lightweight facial models and textual modality in audio-visual emotional understanding in-the-wild,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5824–5834, June 2025. 1, 2, 3

  16. [16]

    Gated Multimodal Units for Information Fusion

    J. Arevalo, T. Solorio, M. Montes-y G ´omez, and F. A. Gonz´alez, “Gated multimodal units for information fusion,” arXiv preprint arXiv:1702.01992, 2017. 1, 2, 3

  17. [17]

    Multimodal transformer for unaligned multimodal language sequences,

    Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceed- ings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pp. 6558–6569, 2019. 3

  18. [18]

    M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,

    T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,” inPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1359–1367, 2020. 1

  19. [19]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J´egou, J. Mairal, P. La- batut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features witho...

  20. [20]

    Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expres- sion recognition,

    S. Li and W. Deng, “Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expres- sion recognition,”IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 356–370, 2019. 2, 3

  21. [21]

    AffectNet: A database for facial expression, valence, and arousal com- puting in the wild,

    A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, valence, and arousal com- puting in the wild,”IEEE Transactions on Affective Comput- ing, vol. 10, no. 1, pp. 18–31, 2017. 2, 3

  22. [22]

    wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,”Advances in Neural Information Processing Systems, 2020. 2, 3, 5

  23. [23]

    Face behavior a la carte: Expressions, affect and action units in a single network,

    D. Kollias, V . Sharmanska, and S. Zafeiriou, “Face behavior a la carte: Expressions, affect and action units in a single network,”arXiv preprint arXiv:1910.11111, 2019. 2

  24. [24]

    Distribution matching for heterogeneous multi-task learning: a large- scale face study,

    D. Kollias, V . Sharmanska, and S. Zafeiriou, “Distribution matching for heterogeneous multi-task learning: a large- scale face study,”arXiv preprint arXiv:2105.03790, 2021. 2

  25. [25]

    Multi-label compound expression recogni- tion: C-EXPR database & network,

    D. Kollias, “Multi-label compound expression recogni- tion: C-EXPR database & network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5589–5598, 2023. 2

  26. [26]

    The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion- specified expression,

    P. Lucey, J. F. Cohn, T. Kanade, J. M. Saragih, Z. Ambadar, and I. A. Matthews, “The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion- specified expression,” in2010 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition Work- shops, pp. 94–101, 2010. 2

  27. [27]

    Training deep networks for facial expression recognition with crowd- sourced label distribution,

    E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd- sourced label distribution,” inACM International Confer- ence on Multimodal Interaction, 2016. 2

  28. [28]

    AffectNet+: A database for enhancing facial expres- sion recognition with soft-labels,

    A. P. Fard, M. M. Hosseini, T. D. Sweeny, and M. H. Ma- hoor, “AffectNet+: A database for enhancing facial expres- sion recognition with soft-labels,”IEEE Transactions on Af- fective Computing, pp. 1–16, 2025. 2

  29. [29]

    Frame-level prediction of facial expres- sions, valence, arousal and action units for mobile devices,

    A. V . Savchenko, “Frame-level prediction of facial expres- sions, valence, arousal and action units for mobile devices,” arXiv preprint arXiv:2203.13436, 2022. 2, 3

  30. [30]

    Exploring expression-related self-supervised learning and spatial reserve pooling for af- fective behaviour analysis,

    F. Xue, Y . Sun, and Y . Yang, “Exploring expression-related self-supervised learning and spatial reserve pooling for af- fective behaviour analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5701–5708, 2023. 2, 3

  31. [31]

    Robust light- weight facial affective behavior recognition with CLIP,

    L. Lin, S. Papabathini, X. Wang, and S. Hu, “Robust light- weight facial affective behavior recognition with CLIP,” arXiv preprint arXiv:2403.09915, 2024. 2

  32. [32]

    A unified approach to facial affect analysis: The mae-face visual representation,

    B. Ma, W. Zhang, F. Qiu, and Y . Ding, “A unified approach to facial affect analysis: The mae-face visual representation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 5924–5933,

  33. [33]

    Facial expression recognition based on multi-modal features for videos in the wild,

    C. Liu, X. Zhang, X. Liu, T. Zhang, L. Meng, Y . Liu, Y . Deng, and W. Jiang, “Facial expression recognition based on multi-modal features for videos in the wild,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 5872–5879, 2023. 2, 3

  34. [34]

    Affective behaviour analysis via integrating multi-modal knowledge,

    X. Wanget al., “Affective behaviour analysis via integrating multi-modal knowledge,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024. 2, 3

  35. [35]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pp. 770– 778, 2016. 2

  36. [36]

    Efficientnet: Rethinking model scal- ing for convolutional neural networks,

    M. Tan and Q. V . Le, “Efficientnet: Rethinking model scal- ing for convolutional neural networks,” inProceedings of the 36th International Conference on Machine Learning, pp. 6105–6114, 2019. 2

  37. [37]

    An im- age is worth 16x16 words: Transformers for image recogni- tion at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An im- age is worth 16x16 words: Transformers for image recogni- tion at scale,” inInternational Conference on Learning Rep- resentations, 2021. 2

  38. [38]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012– 10022, 2021. 2

  39. [39]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,”International Conference on Machine Learning, 2021. 2

  40. [40]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009, 2022. 2

  41. [41]

    Affective behavior analysis using task-adaptive and au-assisted graph network,

    X. Li, W. Du, and H. Yang, “Affective behavior analysis using task-adaptive and au-assisted graph network,”CoRR, vol. abs/2407.11663, 2024. 2

  42. [42]

    Facial expression recognition with hybrid features leveraging dino prior knowledge,

    Y . Xie, C. Ju, and Y . Chang, “Facial expression recognition with hybrid features leveraging dino prior knowledge,”Fron- tiers in Computing and Intelligent Systems, vol. 14, no. 3, pp. 82–88, 2025. 2

  43. [43]

    Hybrid feature fa- cial expression recognition model based on dino prior,

    H. Wang, Y . Deng, T. Liu, and Z. Yang, “Hybrid feature fa- cial expression recognition model based on dino prior,”Com- puter Engineering, vol. 51, no. 10, pp. 284–294, 2025. 2

  44. [44]

    Facial expression recognition based on multi-head cross attention network,

    Y . Zhanget al., “Facial expression recognition based on multi-head cross attention network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022. 2

  45. [45]

    Affect analysis in-the-Wild: Valence-arousal, expressions, action units and a unified framework,

    D. Kollias and S. Zafeiriou, “Affect analysis in-the-Wild: Valence-arousal, expressions, action units and a unified framework,”arXiv preprint arXiv:2103.15792, 2021. 2

  46. [46]

    Distribution matching for multi-task learning of classification tasks: a large-scale study on faces & beyond,

    D. Kollias, V . Sharmanska, and S. Zafeiriou, “Distribution matching for multi-task learning of classification tasks: a large-scale study on faces & beyond,” inProceedings of the AAAI Conference on Artificial Intelligence, pp. 2813–2821,

  47. [47]

    HuBERT: Self- supervised speech representation learning by masked pre- diction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self- supervised speech representation learning by masked pre- diction of hidden units,”arXiv preprint arXiv:2106.07447,

  48. [48]

    BERT: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of NAACL-HLT, pp. 4171– 4186, 2019. 3

  49. [49]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “RoBERTa: A robustly optimized bert pretraining approach,”arXiv preprint arXiv:1907.11692, 2019. 3

  50. [50]

    Tensor fusion network for multimodal sentiment analysis,

    A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” inConference on Empirical Methods in Natural Language Processing, 2017. 3

  51. [51]

    Long short-term mem- ory,

    S. Hochreiter and J. Schmidhuber, “Long short-term mem- ory,”Neural Computation, 1997. 3

  52. [52]

    Enhancing facial expression recognition with LSTM through dual-direction attention mixed feature networks and CLIP,

    J. Cabacas-Maso, E. Ortega-Beltr ´an, I. Benito-Altamirano, and C. Ventura, “Enhancing facial expression recognition with LSTM through dual-direction attention mixed feature networks and CLIP,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5711–5717, 2025. 3

  53. [53]

    Temporal convolutional networks for action segmenta- tion and detection,

    C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmenta- tion and detection,” inProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pp. 1003–1012, 2017. 3

  54. [54]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30 (NeurIPS 2017), 2017. 3

  55. [55]

    Former-DFER: Dynamic facial expres- sion recognition transformer,

    Z. Zhao and Q. Liu, “Former-DFER: Dynamic facial expres- sion recognition transformer,” inProceedings of the 29th ACM International Conference on Multimedia, pp. 1553– 1561, 2021

  56. [56]

    LOGO-Former: Local-global spatio-temporal transformer for dynamic facial expression recognition,

    F. Ma, B. Sun, and S. Li, “LOGO-Former: Local-global spatio-temporal transformer for dynamic facial expression recognition,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, IEEE, 2023. 3

  57. [57]

    Learning on the edge: Investigating boundary filters in CNNs,

    C. Innamorati, T. Ritschel, T. Weyrich, and N. J. Mitra, “Learning on the edge: Investigating boundary filters in CNNs,”International Journal of Computer Vision, vol. 128, pp. 773–782, 2020. 3

  58. [58]

    InsightFace: 2D and 3D face analy- sis project

    J. Guo and J. Deng, “InsightFace: 2D and 3D face analy- sis project.”https://github.com/deepinsight/ insightface, 2025. Accessed: 2026-03-15. 5

  59. [59]

    RetinaFace: Single-shot multi-level face localisation in the wild,

    J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “RetinaFace: Single-shot multi-level face localisation in the wild,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 5203– 5212, 2020. 5 Appendix A. Training and Validation Details A.1. Training Details for Stage I Stage I adapts a pretrained DINOv...