Recognition: no theorem link
A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
Pith reviewed 2026-05-15 11:32 UTC · model grok-4.3
The pith
A two-stage dual-modal model using DINOv2 visual features and Wav2Vec audio features reaches a Macro-F1 of 0.5368 on the ABAW validation set for facial expression recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that robust frame-level visual representations obtained by averaging DINOv2 features from multi-scale re-crops, when combined with frame-aligned Wav2Vec audio features through a lightweight gated fusion module and followed by temporal smoothing, deliver a Macro-F1 score of 0.5368 on the official validation set and 0.5122 plus or minus 0.0277 under five-fold cross-validation, surpassing the provided baselines on the ABAW expression recognition task.
What carries the argument
The two-stage dual-modality pipeline that averages DINOv2 visual features across scales, fuses them with Wav2Vec audio features via a gated module, and applies inference-time temporal smoothing.
If this is right
- The averaging of multi-scale visual features reduces sensitivity to pose and scale variation within individual frames.
- Gated fusion allows the model to weigh acoustic cues when visual information is degraded by motion blur.
- Temporal smoothing at inference improves consistency across adjacent frames without retraining.
- The mixture-of-experts head in the visual stage increases classifier diversity for the eight expression classes.
Where Pith is reading between the lines
- The same feature averaging and gated fusion steps could be tested on other in-the-wild video datasets to check whether the performance gain holds when audio alignment differs.
- If visual features alone already exceed the baseline, the contribution of the audio stream could be isolated by ablating the gated module on the same validation set.
- The approach might extend to continuous emotion regression tasks by replacing the classification head with a regression output while retaining the dual-modal fusion.
- Replacing the fixed temporal smoothing window with a learned recurrent layer could further reduce frame-to-frame label flips on longer video sequences.
Load-bearing premise
That the combination of pretrained DINOv2 and Wav2Vec features with averaging, gated fusion, and smoothing will continue to outperform baselines on the ABAW dataset without strong dependence on particular preprocessing steps or overfitting to its specific conditions.
What would settle it
Running the identical two-stage model on a fresh collection of unconstrained videos that share the same eight expression labels but differ in lighting, audio quality, or face localization statistics, and observing that the Macro-F1 falls below the official baseline values.
Figures
read the original abstract
This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage dual-modality model for frame-level facial expression recognition on the ABAW challenge dataset. Stage I extracts robust visual features using a pretrained DINOv2 ViT-L/14 backbone with padding-aware augmentation (PadAug) and a mixture-of-experts (MoE) head. Stage II performs multi-scale face averaging on visual features, extracts frame-aligned Wav2Vec 2.0 audio features, fuses them with a gated module, and applies inference-time temporal smoothing. The model reports a Macro-F1 of 0.5368 on the official validation set and 0.5122 ± 0.0277 under 5-fold cross-validation, outperforming the official baselines.
Significance. If the results hold, the work provides a competitive empirical demonstration of combining strong pretrained visual and audio encoders with simple fusion and smoothing for in-the-wild expression recognition. The use of 5-fold cross-validation with error bars and explicit baseline comparisons is a positive aspect of the evaluation. However, the lack of ablations and external validation limits insight into whether the two-stage design meaningfully advances beyond the pretrained backbones alone.
major comments (2)
- [Experiments] Experiments section: No ablation studies are reported that isolate the contribution of PadAug, the MoE head, multi-scale averaging, or the gated fusion module versus the DINOv2 and Wav2Vec backbones used in isolation. Without these controls, the headline Macro-F1 gains cannot be confidently attributed to the proposed architecture rather than the strength of the pretrained features.
- [Evaluation and results] Evaluation and results: The paper evaluates exclusively on the ABAW dataset with no tests on additional expression recognition benchmarks. This leaves open the possibility that performance is tied to ABAW-specific preprocessing, data characteristics, or the official validation split, weakening the claim that the method robustly addresses general challenges such as pose variation and motion blur.
minor comments (1)
- [Abstract] Abstract: Training procedures, hyperparameter choices, and the exact composition of the official baselines are not summarized, making it harder for readers to assess reproducibility from the abstract alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Experiments] Experiments section: No ablation studies are reported that isolate the contribution of PadAug, the MoE head, multi-scale averaging, or the gated fusion module versus the DINOv2 and Wav2Vec backbones used in isolation. Without these controls, the headline Macro-F1 gains cannot be confidently attributed to the proposed architecture rather than the strength of the pretrained features.
Authors: We agree that ablation studies are necessary to isolate the contributions of each component. In the revised manuscript, we will add a dedicated ablation section on the ABAW validation set. This will include performance comparisons for the full model versus variants without PadAug, without the MoE head, without multi-scale averaging, and without the gated fusion module, as well as direct comparisons to the DINOv2 and Wav2Vec backbones used in isolation. These results will clarify the source of the observed Macro-F1 improvements. revision: yes
-
Referee: [Evaluation and results] Evaluation and results: The paper evaluates exclusively on the ABAW dataset with no tests on additional expression recognition benchmarks. This leaves open the possibility that performance is tied to ABAW-specific preprocessing, data characteristics, or the official validation split, weakening the claim that the method robustly addresses general challenges such as pose variation and motion blur.
Authors: We acknowledge that evaluation on additional benchmarks would strengthen claims of general robustness. Our submission targets the ABAW challenge specifically, where the dataset characteristics (including pose variation and motion blur) are central. In the revision, we will expand the discussion to explicitly address this limitation, clarify the focus on ABAW, and note that the 5-fold cross-validation with error bars and baseline comparisons provide evidence within this domain. We will also outline plans for future cross-dataset evaluation. revision: partial
Circularity Check
No circularity: purely empirical model evaluation on public dataset
full rationale
The paper describes a two-stage audio-visual pipeline using pretrained DINOv2 and Wav2Vec backbones, custom augmentations, gated fusion, and temporal smoothing, then reports Macro-F1 scores on the ABAW validation set and 5-fold CV. No equations, derivations, or fitted parameters are presented whose outputs are later relabeled as predictions. All performance numbers are direct measurements on held-out competition data rather than quantities forced by self-definition or self-citation chains. The central claim therefore rests on external experimental outcomes, not on any reduction to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A review of affective computing: From unimodal analysis to multimodal fusion,
S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,”Information Fusion, vol. 37, pp. 98–125, 2017. 1, 2
work page 2017
-
[2]
Deep facial expression recognition: A survey,
S. Li and W. Deng, “Deep facial expression recognition: A survey,”IEEE Transactions on Affective Computing, vol. 13, pp. 1195–1215, July 2022
work page 2022
-
[3]
Advances in facial expression recognition: A survey of methods, benchmarks, models, and datasets,
T. Kopalidis, V . Solachidis, N. Vretos, and P. Daras, “Advances in facial expression recognition: A survey of methods, benchmarks, models, and datasets,”Information, vol. 15, no. 3, p. 135, 2024. 1, 2
work page 2024
-
[4]
Expression, affect, action unit recognition: Aff-Wild2, multi-task learning and ArcFace,
D. Kollias and S. Zafeiriou, “Expression, affect, action unit recognition: Aff-Wild2, multi-task learning and ArcFace,” arXiv preprint arXiv:1910.04855, 2019. 1, 2
-
[5]
Advancements in affective and behavior analysis: The 8th ABAW workshop and competition,
D. Kollias, P. Tzirakis, A. Cowen, S. Zafeiriou, I. Kotsia, E. Granger, M. Pedersoli, S. Bacon, A. Baird, C. Gagne, et al., “Advancements in affective and behavior analysis: The 8th ABAW workshop and competition,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5572–5583, 2025. 1, 2, 6
work page 2025
-
[6]
From emotions to violence: Multimodal fine-grained be- havior analysis at the 9th ABAW,
D. Kollias, S. Zafeiriou, I. Kotsia, G. Slabaugh, D. C. Senadeera, J. Zheng, K. K. K. Yadav, C. Shao, and G. Hu, “From emotions to violence: Multimodal fine-grained be- havior analysis at the 9th ABAW,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–12, 2025. 1, 2
work page 2025
-
[7]
10th workshop and competition on affective & behavior analysis in-the-Wild (ABAW)
D. Kollias, S. Zafeiriou, I. Kotsia, P. Tzirakis, A. Cowen, E. Granger, M. Pedersoli, and S. Bacon, “10th workshop and competition on affective & behavior analysis in-the-Wild (ABAW).” Official workshop and competition website, in conjunction with IEEE/CVF CVPR 2026, 2026. Accessed: 2026-03-10. 1
work page 2026
-
[8]
ABAW: Facial expression recognition in the wild,
D. Gera, B. N. S. Kumar, B. V . Raj Kumar, and S. Balasubra- manian, “ABAW: Facial expression recognition in the wild,” arXiv preprint arXiv:2303.09785, 2023. 1, 2
-
[9]
Coarse-to-fine cascaded networks with smooth predicting for video facial expression recognition,
F. Xue, Z. Tan, Y . Zhu, Z. Ma, and G. Guo, “Coarse-to-fine cascaded networks with smooth predicting for video facial expression recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 2412–2418, June 2022. 1, 2, 3
work page 2022
-
[10]
A. V . Savchenko and A. P. Sidorova, “EmotiEffNet and tem- poral convolutional networks in video-based facial expres- sion recognition and action unit detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, pp. 4849–4859, 2024. 2, 3
work page 2024
-
[11]
Exploring facial expression recognition through semi-supervised pre-training and temporal modeling,
J. Yu, Z. Wei, Z. Cai, G. Zhao, Z. Zhang, Y . Wang, G. Xie, J. Zhu, W. Zhu, Q. Liu, and J. Liang, “Exploring facial expression recognition through semi-supervised pre-training and temporal modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4880–4887, 2024. 1, 2, 3
work page 2024
-
[12]
Transformer-based multimodal infor- mation fusion for facial expression analysis,
W. Zhang, F. Qiu, S. Wang, H. Zeng, Z. Zhang, R. An, B. Ma, and Y . Ding, “Transformer-based multimodal infor- mation fusion for facial expression analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, pp. 2428–2437, 2022. 1, 2, 3
work page 2022
-
[13]
D. Dresvyanskiy, M. Markitantov, J. Yu, P. Li, H. Kaya, and A. Karpov, “Sun team’s contribution to ABAW 2024 compe- tition: Audio-visual valence-arousal estimation and expres- sion recognition,”arXiv preprint arXiv:2403.12609, 2024. 2, 3
-
[14]
Ad- vanced facial analysis in multi-modal data with cascaded cross-attention based transformer,
J.-H. Kim, N. Kim, M. Hong, and C. S. Won, “Ad- vanced facial analysis in multi-modal data with cascaded cross-attention based transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 7870–7877, 2024. 2, 3
work page 2024
-
[15]
A. Savchenko and L. Savchenko, “Leveraging lightweight facial models and textual modality in audio-visual emotional understanding in-the-wild,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5824–5834, June 2025. 1, 2, 3
work page 2025
-
[16]
Gated Multimodal Units for Information Fusion
J. Arevalo, T. Solorio, M. Montes-y G ´omez, and F. A. Gonz´alez, “Gated multimodal units for information fusion,” arXiv preprint arXiv:1702.01992, 2017. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Multimodal transformer for unaligned multimodal language sequences,
Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceed- ings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pp. 6558–6569, 2019. 3
work page 2019
-
[18]
M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,
T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,” inPro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1359–1367, 2020. 1
work page 2020
-
[19]
DINOv2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J´egou, J. Mairal, P. La- batut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features witho...
work page 2024
-
[20]
S. Li and W. Deng, “Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expres- sion recognition,”IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 356–370, 2019. 2, 3
work page 2019
-
[21]
AffectNet: A database for facial expression, valence, and arousal com- puting in the wild,
A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, valence, and arousal com- puting in the wild,”IEEE Transactions on Affective Comput- ing, vol. 10, no. 1, pp. 18–31, 2017. 2, 3
work page 2017
-
[22]
wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,”Advances in Neural Information Processing Systems, 2020. 2, 3, 5
work page 2020
-
[23]
Face behavior a la carte: Expressions, affect and action units in a single network,
D. Kollias, V . Sharmanska, and S. Zafeiriou, “Face behavior a la carte: Expressions, affect and action units in a single network,”arXiv preprint arXiv:1910.11111, 2019. 2
-
[24]
Distribution matching for heterogeneous multi-task learning: a large- scale face study,
D. Kollias, V . Sharmanska, and S. Zafeiriou, “Distribution matching for heterogeneous multi-task learning: a large- scale face study,”arXiv preprint arXiv:2105.03790, 2021. 2
-
[25]
Multi-label compound expression recogni- tion: C-EXPR database & network,
D. Kollias, “Multi-label compound expression recogni- tion: C-EXPR database & network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5589–5598, 2023. 2
work page 2023
-
[26]
P. Lucey, J. F. Cohn, T. Kanade, J. M. Saragih, Z. Ambadar, and I. A. Matthews, “The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion- specified expression,” in2010 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition Work- shops, pp. 94–101, 2010. 2
work page 2010
-
[27]
Training deep networks for facial expression recognition with crowd- sourced label distribution,
E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd- sourced label distribution,” inACM International Confer- ence on Multimodal Interaction, 2016. 2
work page 2016
-
[28]
AffectNet+: A database for enhancing facial expres- sion recognition with soft-labels,
A. P. Fard, M. M. Hosseini, T. D. Sweeny, and M. H. Ma- hoor, “AffectNet+: A database for enhancing facial expres- sion recognition with soft-labels,”IEEE Transactions on Af- fective Computing, pp. 1–16, 2025. 2
work page 2025
-
[29]
A. V . Savchenko, “Frame-level prediction of facial expres- sions, valence, arousal and action units for mobile devices,” arXiv preprint arXiv:2203.13436, 2022. 2, 3
-
[30]
F. Xue, Y . Sun, and Y . Yang, “Exploring expression-related self-supervised learning and spatial reserve pooling for af- fective behaviour analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5701–5708, 2023. 2, 3
work page 2023
-
[31]
Robust light- weight facial affective behavior recognition with CLIP,
L. Lin, S. Papabathini, X. Wang, and S. Hu, “Robust light- weight facial affective behavior recognition with CLIP,” arXiv preprint arXiv:2403.09915, 2024. 2
-
[32]
A unified approach to facial affect analysis: The mae-face visual representation,
B. Ma, W. Zhang, F. Qiu, and Y . Ding, “A unified approach to facial affect analysis: The mae-face visual representation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 5924–5933,
-
[33]
Facial expression recognition based on multi-modal features for videos in the wild,
C. Liu, X. Zhang, X. Liu, T. Zhang, L. Meng, Y . Liu, Y . Deng, and W. Jiang, “Facial expression recognition based on multi-modal features for videos in the wild,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 5872–5879, 2023. 2, 3
work page 2023
-
[34]
Affective behaviour analysis via integrating multi-modal knowledge,
X. Wanget al., “Affective behaviour analysis via integrating multi-modal knowledge,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024. 2, 3
work page 2024
-
[35]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pp. 770– 778, 2016. 2
work page 2016
-
[36]
Efficientnet: Rethinking model scal- ing for convolutional neural networks,
M. Tan and Q. V . Le, “Efficientnet: Rethinking model scal- ing for convolutional neural networks,” inProceedings of the 36th International Conference on Machine Learning, pp. 6105–6114, 2019. 2
work page 2019
-
[37]
An im- age is worth 16x16 words: Transformers for image recogni- tion at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An im- age is worth 16x16 words: Transformers for image recogni- tion at scale,” inInternational Conference on Learning Rep- resentations, 2021. 2
work page 2021
-
[38]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012– 10022, 2021. 2
work page 2021
-
[39]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,”International Conference on Machine Learning, 2021. 2
work page 2021
-
[40]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009, 2022. 2
work page 2022
-
[41]
Affective behavior analysis using task-adaptive and au-assisted graph network,
X. Li, W. Du, and H. Yang, “Affective behavior analysis using task-adaptive and au-assisted graph network,”CoRR, vol. abs/2407.11663, 2024. 2
-
[42]
Facial expression recognition with hybrid features leveraging dino prior knowledge,
Y . Xie, C. Ju, and Y . Chang, “Facial expression recognition with hybrid features leveraging dino prior knowledge,”Fron- tiers in Computing and Intelligent Systems, vol. 14, no. 3, pp. 82–88, 2025. 2
work page 2025
-
[43]
Hybrid feature fa- cial expression recognition model based on dino prior,
H. Wang, Y . Deng, T. Liu, and Z. Yang, “Hybrid feature fa- cial expression recognition model based on dino prior,”Com- puter Engineering, vol. 51, no. 10, pp. 284–294, 2025. 2
work page 2025
-
[44]
Facial expression recognition based on multi-head cross attention network,
Y . Zhanget al., “Facial expression recognition based on multi-head cross attention network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022. 2
work page 2022
-
[45]
Affect analysis in-the-Wild: Valence-arousal, expressions, action units and a unified framework,
D. Kollias and S. Zafeiriou, “Affect analysis in-the-Wild: Valence-arousal, expressions, action units and a unified framework,”arXiv preprint arXiv:2103.15792, 2021. 2
-
[46]
D. Kollias, V . Sharmanska, and S. Zafeiriou, “Distribution matching for multi-task learning of classification tasks: a large-scale study on faces & beyond,” inProceedings of the AAAI Conference on Artificial Intelligence, pp. 2813–2821,
-
[47]
HuBERT: Self- supervised speech representation learning by masked pre- diction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self- supervised speech representation learning by masked pre- diction of hidden units,”arXiv preprint arXiv:2106.07447,
-
[48]
BERT: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of NAACL-HLT, pp. 4171– 4186, 2019. 3
work page 2019
-
[49]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “RoBERTa: A robustly optimized bert pretraining approach,”arXiv preprint arXiv:1907.11692, 2019. 3
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[50]
Tensor fusion network for multimodal sentiment analysis,
A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” inConference on Empirical Methods in Natural Language Processing, 2017. 3
work page 2017
-
[51]
S. Hochreiter and J. Schmidhuber, “Long short-term mem- ory,”Neural Computation, 1997. 3
work page 1997
-
[52]
J. Cabacas-Maso, E. Ortega-Beltr ´an, I. Benito-Altamirano, and C. Ventura, “Enhancing facial expression recognition with LSTM through dual-direction attention mixed feature networks and CLIP,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5711–5717, 2025. 3
work page 2025
-
[53]
Temporal convolutional networks for action segmenta- tion and detection,
C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmenta- tion and detection,” inProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pp. 1003–1012, 2017. 3
work page 2017
-
[54]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30 (NeurIPS 2017), 2017. 3
work page 2017
-
[55]
Former-DFER: Dynamic facial expres- sion recognition transformer,
Z. Zhao and Q. Liu, “Former-DFER: Dynamic facial expres- sion recognition transformer,” inProceedings of the 29th ACM International Conference on Multimedia, pp. 1553– 1561, 2021
work page 2021
-
[56]
LOGO-Former: Local-global spatio-temporal transformer for dynamic facial expression recognition,
F. Ma, B. Sun, and S. Li, “LOGO-Former: Local-global spatio-temporal transformer for dynamic facial expression recognition,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, IEEE, 2023. 3
work page 2023
-
[57]
Learning on the edge: Investigating boundary filters in CNNs,
C. Innamorati, T. Ritschel, T. Weyrich, and N. J. Mitra, “Learning on the edge: Investigating boundary filters in CNNs,”International Journal of Computer Vision, vol. 128, pp. 773–782, 2020. 3
work page 2020
-
[58]
InsightFace: 2D and 3D face analy- sis project
J. Guo and J. Deng, “InsightFace: 2D and 3D face analy- sis project.”https://github.com/deepinsight/ insightface, 2025. Accessed: 2026-03-15. 5
work page 2025
-
[59]
RetinaFace: Single-shot multi-level face localisation in the wild,
J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “RetinaFace: Single-shot multi-level face localisation in the wild,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 5203– 5212, 2020. 5 Appendix A. Training and Validation Details A.1. Training Details for Stage I Stage I adapts a pretrained DINOv...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.