Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction

Dinithi Dissanayake; Ovindu Atukorala; Prasanth Sasikumar; Shaveen Silva; Suranga Nanayakkara

arxiv: 2605.21869 · v1 · pith:RPX5JXSVnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.HC

Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction

Dinithi Dissanayake , Shaveen Silva , Ovindu Atukorala , Prasanth Sasikumar , Suranga Nanayakkara This is my paper

Pith reviewed 2026-05-22 07:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC

keywords emotion intensity predictionmultimodal fusionmodality dropoutPearson correlationvideo emotion analysischallenge baselinetwo-stage training

0 comments

The pith

A two-stage process trains separate encoders for text, audio and vision then fuses them to predict six emotion intensity scores from video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that first training modality-specific encoders on their own data and then combining their outputs in a lightweight regressor with modality dropout produces usable predictions for continuous emotion mimicry intensities. A sympathetic reader would care because reliable intensity scores from ordinary video could support applications such as responsive interfaces or affective computing without requiring fully joint end-to-end training from scratch. The authors report that their text-audio-vision-motion system reaches an average Pearson correlation of 0.4722 on an expanded validation split and 0.57 on the hidden test set, finishing third in the challenge. They observe that adding a motion branch produces only marginal improvement yet remains worth examining.

Core claim

We present a staged multimodal framework that first trains modality-specific encoders independently for text, audio and vision, then fuses the resulting representations through a lightweight regressor that applies modality dropout and controlled encoder adaptation; an optional motion branch can be included. Under an expanded 4:1 data split the complete four-modality version attains an average Pearson correlation of 0.4722 on validation, while the same system records 0.57 on the official test set and secures third place in the EMI challenge.

What carries the argument

two-stage multimodal framework that trains modality-specific encoders independently before fusing them with a lightweight regressor using modality dropout and controlled adaptation

If this is right

The four-modality fusion model outperforms the three-modality version on the reported validation split.
Modality dropout during the fusion stage improves robustness across the six emotion dimensions.
Controlled adaptation of the pre-trained encoders during the second stage contributes to the achieved correlations.
The motion branch adds only small gains yet can still be studied for its interaction with the other modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged recipe could be tested on other continuous emotion or affect prediction tasks that use video.
If the independence assumption holds, the approach reduces the need for large amounts of synchronized multimodal training data.
Future work could measure whether removing the dropout step or the adaptation step lowers the Pearson scores by a comparable amount.

Load-bearing premise

Representations from separately trained modality encoders stay useful when combined by a simple regressor that drops modalities during training.

What would settle it

A single jointly trained end-to-end model that obtains a materially higher average Pearson correlation than 0.4722 on the identical validation split would indicate that the two-stage separation is not necessary.

Figures

Figures reproduced from arXiv: 2605.21869 by Dinithi Dissanayake, Ovindu Atukorala, Prasanth Sasikumar, Shaveen Silva, Suranga Nanayakkara.

**Figure 1.** Figure 1: Illustration of our two-stage multimodal framework for Emotional Mimicry Intensity (EMI) prediction. The framework combines [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

We present our submission to the Hume-ABAW10 Emotional Mimicry Intensity (EMI) Challenge, which aims to predict six continuous emotion intensity dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy, from in-the-wild multimodal video clips. We propose a staged multimodal framework that combines textual, acoustic, and visual representations, with an optional motion branch. Our approach first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout and controlled encoder adaptation. Across our submitted systems, the best validation performance is obtained by the text--audio--vision--motion fusion model under the expanded 4:1 split, achieving an average Pearson correlation of 0.4722. Although the motion branch yields only very slight gains, its behavior can be interesting to study. Our team was placed third in the EMI challenge, achieving an average Pearson correlation of 0.57 for the test set. Overall, we provide a practical and reproducible baseline for EMI prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward challenge submission that hits third place with a 0.57 test correlation using routine two-stage multimodal fusion, but the lack of ablations leaves the claimed benefit of the staging untested.

read the letter

The main thing to know is that this paper describes a practical entry to the Hume-ABAW10 EMI Challenge. The authors train separate encoders for text, audio, vision, and optional motion, then fuse the outputs in a lightweight regressor that includes modality dropout and limited adaptation. Their best validation run reaches 0.4722 average Pearson correlation under a 4:1 split, and the test set score of 0.57 lands them third overall. They also note that the motion branch adds only marginal value. The work supplies a clear, reproducible baseline for predicting continuous emotion intensities from in-the-wild video clips, which is useful for anyone building affective systems or entering similar benchmarks. The numbers are stated plainly and the pipeline is easy to follow from the description. The soft spot is exactly the one the stress-test flags: the authors present the two-stage procedure as central to the result, yet no ablation compares it against end-to-end joint training or against freezing the same backbones and training only the regressor. Without those controls it is difficult to know whether the correlations come from the staging or simply from the choice of pre-trained models. The paper is honest about the small motion gain and does not overclaim theoretical advances, but the central performance attribution rests on an untested premise. This kind of submission is mainly for researchers working on multimodal emotion estimation or challenge baselines in affective computing. It gives a usable reference point rather than a new mechanism or theoretical insight. I would send it to peer review in a challenge or applications venue because the empirical result is concrete and the setup is transparent enough for others to replicate or extend.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a two-stage multimodal framework for the Emotional Mimicry Intensity (EMI) Challenge, involving independent training of modality-specific encoders (text, audio, vision, and optional motion) followed by fusion via a lightweight regressor with modality dropout and controlled adaptation. The authors report an average Pearson correlation of 0.4722 on the expanded 4:1 validation split for the text-audio-vision-motion model and a test set correlation of 0.57, placing third in the challenge.

Significance. If the performance gains can be attributed to the proposed two-stage training, the work supplies a practical, reproducible baseline for multimodal emotion intensity prediction from in-the-wild videos. The approach leverages pre-trained models effectively and could serve as a starting point for future EMI-related research, but the lack of comparative experiments reduces its ability to demonstrate the specific value of the staging mechanism.

major comments (1)

[Abstract and framework description] The headline results (0.4722 validation Pearson correlation for the full model and 0.57 on test) are presented as outcomes of the two-stage independent encoder training plus fusion. No ablation studies are reported that isolate the contribution of the independent pre-training stage versus end-to-end joint training or versus freezing the pre-trained encoders and training only the regressor. This makes it impossible to determine whether the two-stage mechanism is responsible for the observed performance or if the results stem primarily from the choice of backbone models.

minor comments (2)

[Experimental setup] Details on the exact data splits, including how the 'expanded 4:1 split' was constructed, and any hyperparameter tuning procedures would improve reproducibility.
[Results] Inclusion of baseline comparisons or single-modality performances would help contextualize the multimodal fusion gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing the two-stage multimodal framework for the EMI Challenge. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract and framework description] The headline results (0.4722 validation Pearson correlation for the full model and 0.57 on test) are presented as outcomes of the two-stage independent encoder training plus fusion. No ablation studies are reported that isolate the contribution of the independent pre-training stage versus end-to-end joint training or versus freezing the pre-trained encoders and training only the regressor. This makes it impossible to determine whether the two-stage mechanism is responsible for the observed performance or if the results stem primarily from the choice of backbone models.

Authors: We agree that the manuscript does not report ablation studies isolating the independent pre-training stage from end-to-end joint training or from training only the regressor with frozen encoders. Our submission focused on providing a practical, reproducible baseline for the challenge that combines pre-trained modality encoders with a lightweight fusion regressor and modality dropout. The two-stage procedure was chosen to allow independent optimization of each modality before controlled adaptation during fusion. We acknowledge that direct comparisons would more clearly attribute performance gains to the staging mechanism rather than backbone selection alone. In the revised manuscript we will add these ablation experiments. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical performance reported on held-out challenge data

full rationale

The paper presents a two-stage multimodal training procedure (independent modality encoders followed by fusion regressor) and reports average Pearson correlations (0.4722 validation, 0.57 test) on the EMI challenge splits. No equations, derivations, or parameter fittings are described that reduce the reported metrics to quantities fitted inside the same model or to self-citations by construction. The results are measured on externally held-out test data, the framework is offered as a reproducible baseline, and no uniqueness theorems, ansatzes, or renamings of known results are invoked in a load-bearing manner. The central claim therefore remains self-contained empirical evaluation rather than tautological reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that pre-trained or independently trained modality encoders capture emotion-relevant features and that simple fusion plus dropout improves regression without introducing harmful interference; no new entities or free parameters are introduced in the abstract.

axioms (1)

domain assumption Independently trained modality encoders produce representations that remain complementary when fused by a lightweight regressor
Invoked by the description of the two-stage training procedure and the claim that this yields the best validation score.

pith-pipeline@v0.9.0 · 5735 in / 1342 out tokens · 52014 ms · 2026-05-22T07:55:37.108973+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

[1]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 2, 3

work page 2020
[2]

Openface: an open source facial behavior anal- ysis toolkit

Tadas Baltru ˇsaitis, Peter Robinson, and Louis-Philippe Morency. Openface: an open source facial behavior anal- ysis toolkit. In2016 IEEE winter conference on applications of computer vision (WACV), pages 1–10. IEEE, 2016. 3

work page 2016
[3]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christo- pher D Manning. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2003
[4]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 2

work page 2019
[5]

Cross-attention transformer-based visual-language fu- sion for multimodal image analysis.International Journal of Applied Science, 8(1):p27–p27, 2025

Liwei Ding, Kowei Shih, Hairu Wen, Xinshi Li, and Qin Yang. Cross-attention transformer-based visual-language fu- sion for multimodal image analysis.International Journal of Applied Science, 8(1):p27–p27, 2025. 2

work page 2025
[6]

Unimodal multi-task fusion for emo- tional mimicry intensity prediction

Tobias Hallmen, Fabian Deuser, Norbert Oswald, and Elisabeth Andr ´e. Unimodal multi-task fusion for emo- tional mimicry intensity prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4657–4665, 2024. 2

work page 2024
[7]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2

work page 2016
[8]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2

work page 2022
[9]

Cnn archi- tectures for large-scale audio classification

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn archi- tectures for large-scale audio classification. In2017 ieee in- ternational conference on acoustics, speech and signal pro- cessing (icassp), pages 131–135. IEEE, 2017. 2

work page 2017
[10]

Multimodal emotion regression with multi- objective optimization and vad-aware audio modeling for the 10th abaw emi track.arXiv preprint arXiv:2603.13760,

Jiawen Huang, Chenxi Huang, Zhuofan Wen, Hailiang Yao, Shun Chen, Longjiang Yang, Cong Yu, Fengyu Zhang, Ran Liu, and Bin Liu. Multimodal emotion regression with multi- objective optimization and vad-aware audio modeling for the 10th abaw emi track.arXiv preprint arXiv:2603.13760,

work page arXiv
[11]

Abaw: Valence-arousal esti- mation, expression recognition, action unit detection & emo- tional reaction intensity estimation challenges, 2023

Dimitrios Kollias, Panagiotis Tzirakis, Alice Baird, Alan Cowen, and Stefanos Zafeiriou. Abaw: Valence-arousal esti- mation, expression recognition, action unit detection & emo- tional reaction intensity estimation challenges, 2023. 2

work page 2023
[12]

Ad- vancements in affective and behavior analysis: The 8th abaw workshop and competition

Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, Ste- fanos Zafeiriou, Irene Kotsia, Eric Granger, Marco Peder- soli, Simon Bacon, Alice Baird, Chris Gagne, et al. Ad- vancements in affective and behavior analysis: The 8th abaw workshop and competition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5572– 5583, 2025. 1, 5

work page 2025
[13]

Exploration of a self-supervised speech model: A study on emotional corpora

Yuanchao Li, Yumnah Mohamied, Peter Bell, and Catherine Lai. Exploration of a self-supervised speech model: A study on emotional corpora. In2022 IEEE Spoken Language Tech- nology Workshop (SLT), pages 868–875. IEEE, 2023. 2

work page 2023
[14]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Emo- tion recognition from speech using wav2vec 2.0 embeddings

Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emo- tion recognition from speech using wav2vec 2.0 embeddings. arXiv preprint arXiv:2104.03502, 2021. 3

work page arXiv 2021
[17]

Language-guided multi-modal emotional mimicry intensity estimation

Feng Qiu, Wei Zhang, Chen Liu, Lincheng Li, Heming Du, Tianchen Guo, and Xin Yu. Language-guided multi-modal emotional mimicry intensity estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4742–4751, 2024. 2

work page 2024
[18]

Textualized and feature-based models for compound mul- timodal emotion recognition in the wild

Nicolas Richet, Soufiane Belharbi, Haseeb Aslam, Meike Emilie Schadt, Manuela Gonz ´alez-Gonz´alez, Gustave Cortal, Alessandro Lameiras Koerich, Marco Pedersoli, Alain Finkel, Simon Bacon, and Eric Granger. Textualized and feature-based models for compound mul- timodal emotion recognition in the wild. InComputer Vision – ECCV 2024 Workshops: Milan, Italy,...

work page 2024
[19]

Hsemotion team at abaw-8 com- petition: Audiovisual ambivalence/hesitancy, emotional mimicry intensity and facial expression recognition.arXiv preprint arXiv:2503.10399, 2025

Andrey V Savchenko. Hsemotion team at abaw-8 com- petition: Audiovisual ambivalence/hesitancy, emotional mimicry intensity and facial expression recognition.arXiv preprint arXiv:2503.10399, 2025. 2

work page arXiv 2025
[20]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR,

work page
[21]

Dawn of the transformer era in speech emotion recognition: closing the valence gap.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(9): 10745–10759, 2023

Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wier- storf, Maximilian Schmitt, Felix Burkhardt, Florian Eyben, and Bj¨orn W Schuller. Dawn of the transformer era in speech emotion recognition: closing the valence gap.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(9): 10745–10759, 2023. arXiv preprint arXiv:2203.07378. 3

work page arXiv 2023
[22]

Efficient feature extraction and late fusion strategy for audiovisual emotional mimicry in- tensity estimation

Jun Yu, Wangyuan Zhu, Jichao Zhu, Zhongpeng Cai, Gong- peng Zhao, Zerui Zhang, Guochen Xie, Zhihong Wei, Qing- song Liu, and Jiaen Liang. Efficient feature extraction and late fusion strategy for audiovisual emotional mimicry in- tensity estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4866–4872, 2024. 2

work page 2024
[23]

Dual-stage cross-modal network with dynamic feature fusion for emotional mimicry intensity estimation

Jun Yu, Lingsi Zhu, Yanjun Chi, Yunxiang Zhang, Yang Zhen, Yongqi Wang, and Xilong Lu. Dual-stage cross-modal network with dynamic feature fusion for emotional mimicry intensity estimation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 5733–5740,

work page
[24]

Technical approach for the emi challenge in the 8th affective behavior analysis in-the-wild competition.arXiv preprint arXiv:2503.10603,

Jun Yu, Lingsi Zhu, Yanjun Chi, Yunxiang Zhang, Yang Zheng, Yongqi Wang, and Xilong Lu. Technical approach for the emi challenge in the 8th affective behavior analysis in-the-wild competition.arXiv preprint arXiv:2503.10603,

work page arXiv
[25]

Affective behaviour analy- sis via integrating multi-modal knowledge.arXiv preprint arXiv:2403.10825, 2024

Wei Zhang, Feng Qiu, Chen Liu, Lincheng Li, Heming Du, Tiancheng Guo, and Xin Yu. Affective behaviour analy- sis via integrating multi-modal knowledge.arXiv preprint arXiv:2403.10825, 2024. 2

work page arXiv 2024
[26]

Anchoring emotions in text: Robust multimodal fusion for mimicry intensity estimation.arXiv preprint arXiv:2603.14976, 2026

Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang, Naixiang Zheng, Guoyuan Wang, Jun Yu, Jiaen Liang, Wei Huang, Shengping Liu, and Ximin Zheng. Anchoring emotions in text: Robust multimodal fusion for mimicry intensity estimation.arXiv preprint arXiv:2603.14976, 2026. 6

work page arXiv 2026

[1] [1]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 2, 3

work page 2020

[2] [2]

Openface: an open source facial behavior anal- ysis toolkit

Tadas Baltru ˇsaitis, Peter Robinson, and Louis-Philippe Morency. Openface: an open source facial behavior anal- ysis toolkit. In2016 IEEE winter conference on applications of computer vision (WACV), pages 1–10. IEEE, 2016. 3

work page 2016

[3] [3]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christo- pher D Manning. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2003

[4] [4]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 2

work page 2019

[5] [5]

Cross-attention transformer-based visual-language fu- sion for multimodal image analysis.International Journal of Applied Science, 8(1):p27–p27, 2025

Liwei Ding, Kowei Shih, Hairu Wen, Xinshi Li, and Qin Yang. Cross-attention transformer-based visual-language fu- sion for multimodal image analysis.International Journal of Applied Science, 8(1):p27–p27, 2025. 2

work page 2025

[6] [6]

Unimodal multi-task fusion for emo- tional mimicry intensity prediction

Tobias Hallmen, Fabian Deuser, Norbert Oswald, and Elisabeth Andr ´e. Unimodal multi-task fusion for emo- tional mimicry intensity prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4657–4665, 2024. 2

work page 2024

[7] [7]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2

work page 2016

[8] [8]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2

work page 2022

[9] [9]

Cnn archi- tectures for large-scale audio classification

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn archi- tectures for large-scale audio classification. In2017 ieee in- ternational conference on acoustics, speech and signal pro- cessing (icassp), pages 131–135. IEEE, 2017. 2

work page 2017

[10] [10]

Multimodal emotion regression with multi- objective optimization and vad-aware audio modeling for the 10th abaw emi track.arXiv preprint arXiv:2603.13760,

Jiawen Huang, Chenxi Huang, Zhuofan Wen, Hailiang Yao, Shun Chen, Longjiang Yang, Cong Yu, Fengyu Zhang, Ran Liu, and Bin Liu. Multimodal emotion regression with multi- objective optimization and vad-aware audio modeling for the 10th abaw emi track.arXiv preprint arXiv:2603.13760,

work page arXiv

[11] [11]

Abaw: Valence-arousal esti- mation, expression recognition, action unit detection & emo- tional reaction intensity estimation challenges, 2023

Dimitrios Kollias, Panagiotis Tzirakis, Alice Baird, Alan Cowen, and Stefanos Zafeiriou. Abaw: Valence-arousal esti- mation, expression recognition, action unit detection & emo- tional reaction intensity estimation challenges, 2023. 2

work page 2023

[12] [12]

Ad- vancements in affective and behavior analysis: The 8th abaw workshop and competition

Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, Ste- fanos Zafeiriou, Irene Kotsia, Eric Granger, Marco Peder- soli, Simon Bacon, Alice Baird, Chris Gagne, et al. Ad- vancements in affective and behavior analysis: The 8th abaw workshop and competition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5572– 5583, 2025. 1, 5

work page 2025

[13] [13]

Exploration of a self-supervised speech model: A study on emotional corpora

Yuanchao Li, Yumnah Mohamied, Peter Bell, and Catherine Lai. Exploration of a self-supervised speech model: A study on emotional corpora. In2022 IEEE Spoken Language Tech- nology Workshop (SLT), pages 868–875. IEEE, 2023. 2

work page 2023

[14] [14]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Emo- tion recognition from speech using wav2vec 2.0 embeddings

Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emo- tion recognition from speech using wav2vec 2.0 embeddings. arXiv preprint arXiv:2104.03502, 2021. 3

work page arXiv 2021

[17] [17]

Language-guided multi-modal emotional mimicry intensity estimation

Feng Qiu, Wei Zhang, Chen Liu, Lincheng Li, Heming Du, Tianchen Guo, and Xin Yu. Language-guided multi-modal emotional mimicry intensity estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4742–4751, 2024. 2

work page 2024

[18] [18]

Textualized and feature-based models for compound mul- timodal emotion recognition in the wild

Nicolas Richet, Soufiane Belharbi, Haseeb Aslam, Meike Emilie Schadt, Manuela Gonz ´alez-Gonz´alez, Gustave Cortal, Alessandro Lameiras Koerich, Marco Pedersoli, Alain Finkel, Simon Bacon, and Eric Granger. Textualized and feature-based models for compound mul- timodal emotion recognition in the wild. InComputer Vision – ECCV 2024 Workshops: Milan, Italy,...

work page 2024

[19] [19]

Hsemotion team at abaw-8 com- petition: Audiovisual ambivalence/hesitancy, emotional mimicry intensity and facial expression recognition.arXiv preprint arXiv:2503.10399, 2025

Andrey V Savchenko. Hsemotion team at abaw-8 com- petition: Audiovisual ambivalence/hesitancy, emotional mimicry intensity and facial expression recognition.arXiv preprint arXiv:2503.10399, 2025. 2

work page arXiv 2025

[20] [20]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR,

work page

[21] [21]

Dawn of the transformer era in speech emotion recognition: closing the valence gap.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(9): 10745–10759, 2023

Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wier- storf, Maximilian Schmitt, Felix Burkhardt, Florian Eyben, and Bj¨orn W Schuller. Dawn of the transformer era in speech emotion recognition: closing the valence gap.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(9): 10745–10759, 2023. arXiv preprint arXiv:2203.07378. 3

work page arXiv 2023

[22] [22]

Efficient feature extraction and late fusion strategy for audiovisual emotional mimicry in- tensity estimation

Jun Yu, Wangyuan Zhu, Jichao Zhu, Zhongpeng Cai, Gong- peng Zhao, Zerui Zhang, Guochen Xie, Zhihong Wei, Qing- song Liu, and Jiaen Liang. Efficient feature extraction and late fusion strategy for audiovisual emotional mimicry in- tensity estimation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4866–4872, 2024. 2

work page 2024

[23] [23]

Dual-stage cross-modal network with dynamic feature fusion for emotional mimicry intensity estimation

Jun Yu, Lingsi Zhu, Yanjun Chi, Yunxiang Zhang, Yang Zhen, Yongqi Wang, and Xilong Lu. Dual-stage cross-modal network with dynamic feature fusion for emotional mimicry intensity estimation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 5733–5740,

work page

[24] [24]

Technical approach for the emi challenge in the 8th affective behavior analysis in-the-wild competition.arXiv preprint arXiv:2503.10603,

Jun Yu, Lingsi Zhu, Yanjun Chi, Yunxiang Zhang, Yang Zheng, Yongqi Wang, and Xilong Lu. Technical approach for the emi challenge in the 8th affective behavior analysis in-the-wild competition.arXiv preprint arXiv:2503.10603,

work page arXiv

[25] [25]

Affective behaviour analy- sis via integrating multi-modal knowledge.arXiv preprint arXiv:2403.10825, 2024

Wei Zhang, Feng Qiu, Chen Liu, Lincheng Li, Heming Du, Tiancheng Guo, and Xin Yu. Affective behaviour analy- sis via integrating multi-modal knowledge.arXiv preprint arXiv:2403.10825, 2024. 2

work page arXiv 2024

[26] [26]

Anchoring emotions in text: Robust multimodal fusion for mimicry intensity estimation.arXiv preprint arXiv:2603.14976, 2026

Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang, Naixiang Zheng, Guoyuan Wang, Jun Yu, Jiaen Liang, Wei Huang, Shengping Liu, and Ximin Zheng. Anchoring emotions in text: Robust multimodal fusion for mimicry intensity estimation.arXiv preprint arXiv:2603.14976, 2026. 6

work page arXiv 2026