Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

Hanna Jang; Hyunseo Kim; Junghyun Lee; Junhyug Noh

arxiv: 2605.21417 · v1 · pith:UT3ABL43new · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

Junghyun Lee , Hyunseo Kim , Hanna Jang , Junhyug Noh This is my paper

Pith reviewed 2026-05-21 05:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords blended emotion recognitionmulti-encoder fusionrank-aware selectionattention-based gatingmultimodal emotiondomain adaptationvideo audio fusion

0 comments

The pith

A rank-aware gating module selects and fuses only the top-n encoders per sample to better recognize blended emotions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Blended emotions appear as overlapping mixtures of video and audio cues rather than one clear signal, so single encoders or simple averaging of many encoders often miss the right combination. The framework first maps outputs from several pre-trained encoders into one shared space, then applies an attention gate that scores each encoder's usefulness for the current sample. It keeps only the highest-scoring encoders, predicts emotion presence and intensity separately, and merges those predictions at the probability level. Unsupervised feature alignment adds robustness when test data differs from training data. On the BlEmoRE benchmark this selective approach beat both single-encoder baselines and full multi-encoder fusion, finishing second in the competition.

Core claim

The paper shows that ordering encoders by sample-specific importance and fusing only the top-ranked subset produces more accurate fine-grained blended emotion labels than either any individual encoder or any non-selective combination of the same encoders.

What carries the argument

An attention-based gating module that computes sample-wise importance scores for each encoder and restricts fusion to the top-n highest-scoring ones.

If this is right

Using only the top-n encoders outperforms both single-encoder baselines and naive fusion of all encoders.
Separating presence and salience prediction heads followed by probability-level fusion improves modeling of mixed emotions.
Feature-level unsupervised domain adaptation increases robustness to distribution shifts without requiring pseudo-labels.
The full pipeline placed second in the BlEmoRE challenge, showing practical gains on real multimodal emotion data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same top-n selection logic could reduce compute in other multimodal settings where some encoders are redundant for certain inputs.
Replacing the fixed n with a learned threshold might further improve results by adapting the number of kept encoders automatically.
The approach highlights that encoder ordering, not just which encoders are available, is a key design choice for blended-signal tasks.

Load-bearing premise

The gating module's importance scores correctly identify which encoders are actually useful for a given sample so that discarding the rest improves the final output.

What would settle it

Train the same model twice on the BlEmoRE data, once with the gating and top-n selection active and once with all encoders always fused, then compare their accuracy on the official test set; a clear drop when selection is removed would support the claim.

Figures

Figures reproduced from arXiv: 2605.21417 by Hanna Jang, Hyunseo Kim, Junghyun Lee, Junhyug Noh.

**Figure 1.** Figure 1: Overview of the proposed framework. Heterogeneous encoder features are first projected into a shared 256-d embedding space. An attentionbased gating module estimates sample-wise encoder importance, after which only the top-n encoders are retained for weighted fusion into a 512-d shared representation. Two prediction heads model emotion presence and salience, and their outputs are aligned through probabili… view at source ↗

**Figure 2.** Figure 2: Effect of the number of selected encoders [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of modality-group importance scores across samples. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Top-n selection frequency for each encoder. A small subset of encoders is selected in most samples, while many others are used much less frequently. The gradually decaying distribution indicates that encoder usefulness is highly uneven, supporting the need for ranking-based selective fusion [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mean encoder importance across folds. High-importance encoders remain consistently dominant across different folds, while low-importance [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of importance weights for representative encoders. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Pairwise Linear CKA similarity between projected encoder [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and na\"ive multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's selective fusion via attention-based ranking got 2nd in the BlEmoRE challenge and adds decoupled presence/salience heads plus unsupervised adaptation, but lacks ablations to show the ranking step drives the gains.

read the letter

The main takeaway is that this framework selectively fuses top-n encoders per sample using an attention gate, decouples presence and salience predictions, and adds feature-level unsupervised domain adaptation. It beat single encoders and naive fusion on the challenge data and landed second overall. That competition placement gives the empirical claim some weight without needing to overclaim broader impact.

Referee Report

2 major / 2 minor

Summary. The paper proposes a rank-aware multi-encoder framework for blended emotion recognition that projects heterogeneous video and audio encoder features into a shared latent space, uses an attention-based gating module to estimate per-sample encoder importance, fuses only the top-n encoders, decouples prediction into presence and salience heads aligned via probability-level fusion, and adds feature-level unsupervised domain adaptation. Experiments on the BlEmoRE challenge reportedly outperform individual encoders and naive multi-encoder baselines, with the final system placing 2nd in the competition.

Significance. If the selective fusion mechanism can be shown to drive the gains independently of other architectural choices, the work would offer a practical advance in handling fine-grained multimodal blended emotions by demonstrating that encoder ordering and selection matter. The competition ranking supplies external corroboration, but the absence of detailed experimental controls limits the strength of the central claim.

major comments (2)

[Experiments] Experiments section: the headline attribution of gains to rank-aware selective fusion requires an ablation that isolates the attention-based ranking step (e.g., learned top-n selection versus random selection of n encoders, versus fixed encoder subset, versus full fusion) while holding the decoupled heads and domain adaptation fixed. No such control is described, so it remains possible that improvements arise from the other components rather than the rank-aware mechanism.
[Abstract / Experiments] Abstract and Experiments: the central performance claim is presented without reported error bars, statistical significance tests, data-split details, or cross-validation procedure, leaving the outperformance and 2nd-place ranking without visible verification steps.

minor comments (2)

[Abstract] Abstract: the phrase “naïve multi-encoder fusion baselines” should be spelled consistently (currently rendered with escaped quote).
[Method] Method: clarify the exact value of n used for top-n selection and whether it is fixed or learned per sample.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We address the major comments point by point below, agreeing where revisions are needed to strengthen the claims regarding the rank-aware selective fusion.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline attribution of gains to rank-aware selective fusion requires an ablation that isolates the attention-based ranking step (e.g., learned top-n selection versus random selection of n encoders, versus fixed encoder subset, versus full fusion) while holding the decoupled heads and domain adaptation fixed. No such control is described, so it remains possible that improvements arise from the other components rather than the rank-aware mechanism.

Authors: We agree that a more targeted ablation is necessary to isolate the effect of the rank-aware selection. Our current experiments compare against naive multi-encoder baselines and individual encoders, but do not include random selection or fixed subsets while keeping other components constant. In the revised manuscript, we will add these ablations to demonstrate that the learned ranking and selective fusion contribute to the performance gains independently. revision: yes
Referee: [Abstract / Experiments] Abstract and Experiments: the central performance claim is presented without reported error bars, statistical significance tests, data-split details, or cross-validation procedure, leaving the outperformance and 2nd-place ranking without visible verification steps.

Authors: We acknowledge that the current version lacks error bars, statistical significance testing, and explicit details on data splits and cross-validation. The 2nd place ranking in the BlEmoRE challenge provides external validation, but to address this, we will include error bars from multiple runs with different seeds, perform paired t-tests or similar for significance, and clarify the data-split procedure in the experiments section of the revised paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical competition ranking

full rationale

The paper's method consists of standard architectural components (feature projection, attention gating for per-sample importance, top-n selection, decoupled presence/salience heads, and unsupervised domain adaptation) whose effectiveness is asserted via outperformance on the BlEmoRE challenge and a 2nd-place ranking. No equations, fitted-parameter renamings, or self-citation chains are present that would reduce any prediction or uniqueness claim to the inputs by construction. The central result is externally falsifiable via the public competition leaderboard and does not rely on internal self-definition or load-bearing prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method implicitly assumes that heterogeneous encoder features can be meaningfully projected into a shared latent space and that top-n selection via attention improves fusion.

pith-pipeline@v0.9.0 · 5689 in / 1148 out tokens · 25113 ms · 2026-05-21T05:16:09.267671+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders.
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate blended emotion recognition as a selective fusion problem, where encoder contributions are ranked dynamically rather than treated uniformly.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 6 internal anchors

[1]

Baevski, Y

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020

work page 2020
[2]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Baltrusaitis, A

T. Baltrusaitis, A. Zadeh, Y . C. Lim, and L.-P. Morency. Openface 2.0: Facial behavior analysis toolkit. In2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 59–66, 2018

work page 2018
[4]

Baltru ˇsaitis, P

T. Baltru ˇsaitis, P. Robinson, and L.-P. Morency. Openface: An open source facial behavior analysis toolkit. In2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–10, 2016

work page 2016
[5]

L. F. Barrett, K. A. Lindquist, and M. Gendron. Language as context for the perception of emotion.Trends in Cognitive Sciences, 11(8):327–332, 2007

work page 2007
[6]

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

work page 2022
[7]

Cheng, Z

H. Cheng, Z. Zhao, Y . He, Z. Hu, J. Li, M. Wang, and R. Hong. Vaemo: Efficient representation learning for visual-audio emotion with knowledge injection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5547–5556, 2025

work page 2025
[8]

Darwin.The Expression of the Emotions in Man and Animals

C. Darwin.The Expression of the Emotions in Man and Animals. John Murray, 1872

work page
[9]

S. Du, Y . Tao, and A. M. Martinez. Compound facial expressions of emotion.Proceedings of the National Academy of Sciences, 111(15):E1454–E1462, 2014

work page 2014
[10]

P. Ekman. An argument for basic emotions.Cognition & Emotion, 6(3-4):169–200, 1992

work page 1992
[11]

Ekman and D

P. Ekman and D. Cordaro. What is meant by calling emotions basic. Emotion Review, 3(4):364–370, 2011

work page 2011
[12]

Ganin, E

Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio- lette, M. Marchand, and V . Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

work page 2016
[13]

Girdhar, A

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra. Imagebind one embedding space to bind them all. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

work page 2023
[14]

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021

work page 2021
[15]

J. Hu, L. Mathur, P. P. Liang, and L.-P. Morency. Openface 3.0: A lightweight multitask system for comprehensive facial behavior analysis. pages 1–11, 2025

work page 2025
[16]

Israelsson, A

A. Israelsson, A. Seiger, and P. Laukka. Blended emotions can be accurately recognized from dynamic facial and vocal expressions. Journal of Nonverbal Behavior, 47(3):267–284, 2023

work page 2023
[17]

S. K. Khare, V . Blanes-Vidal, E. S. Nadimi, and U. R. Acharya. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations.Information Fusion, 102:102019, 2024

work page 2014
[18]

D. Kollias. Multi-label compound expression recognition: C-expr database & network. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5589–5598, 2023

work page 2023
[19]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

work page 2019
[20]

Lachmann, A

T. Lachmann, A. Israelsson, C. Tornberg, T. Saghinadze, M. Balazia, P. M¨uller, and P. Laukka. Not all blends are equal: The blemore dataset of blended emotion expressions with relative salience annotations, 2026

work page 2026
[21]

H. Lian, C. Lu, S. Li, Y . Zhao, C. Tang, and Y . Zong. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face.Entropy, 25(10):1440, 2023

work page 2023
[22]

Z. Lian, L. Sun, Y . Ren, H. Gu, H. Sun, L. Chen, B. Liu, and J. Tao. Merbench: A unified evaluation benchmark for multimodal emotion recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2026

work page 2026
[23]

K. A. Lindquist, J. K. MacCormack, and H. Shablack. The role of language in emotion: Predictions from psychological constructionism. Frontiers in Psychology, 6:121301, 2015

work page 2015
[24]

X. Mai, J. Lin, H. Wang, Z. Tao, et al. All rivers run into the sea: Unified modality brain-inspired emotional central mechanism. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 632–641, 2024

work page 2024
[25]

Moeller, Z

J. Moeller, Z. Ivcevic Pringle, and A. White. Mixed emotions: Network analyses of intra-individual co-occurrences within and across situations.Emotion, 18:1106–1121, 2018

work page 2018
[26]

Oatley and E

K. Oatley and E. Duncan. The experience of emotions in everyday life.Cognition & Emotion, 8(4):369–381, 1994

work page 1994
[27]

Oh and E

V . Oh and E. Tong. Specificity in the study of mixed emotions: A theoretical framework.Personality and Social Psychology Review, 26(4):283–314, 2022

work page 2022
[28]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Priyasad, T

D. Priyasad, T. Fernando, S. Denman, S. Sridharan, and C. Fookes. Attention driven fusion for multi-modal emotion recognition. pages 3227–3231, 2020

work page 2020
[30]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, et al. Learning transfer- able visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021

work page 2021
[31]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervi- sion. InInternational conference on machine learning, pages 28492– 28518. PMLR, 2023

work page 2023
[32]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

L. Sun, Z. Lian, B. Liu, and J. Tao. Mae-dfer: Efficient masked au- toencoder for self-supervised dynamic facial expression recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023

work page 2023
[35]

L. Sun, Z. Lian, B. Liu, and J. Tao. Hicmae: Hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recog- nition.Information Fusion, 108:102382, 2024

work page 2024
[36]

Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023

work page 2023
[38]

P. Yang, N. Liu, X. Liu, Y . Shu, et al. A multimodal dataset for mixed emotion recognition.Scientific Data, 11, 2024

work page 2024
[39]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023

work page 2023
[40]

J. Zhao, Q. Yang, Y . Peng, D. Bai, et al. Humanomni: A large vision- speech language model for human-centric video understanding.arXiv preprint arXiv:2501.15111, 2025

work page arXiv 2025
[41]

Zhao and J

Y . Zhao and J. Xu. Compound micro-expression recognition system. In2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), pages 728–733, 2020

work page 2020
[42]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. APPENDIX A. Related Work Psychological foundations of blended emotions.Classic theories describe basic emotions as distinguishable af...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Baevski, Y

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020

work page 2020

[2] [2]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Baltrusaitis, A

T. Baltrusaitis, A. Zadeh, Y . C. Lim, and L.-P. Morency. Openface 2.0: Facial behavior analysis toolkit. In2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 59–66, 2018

work page 2018

[4] [4]

Baltru ˇsaitis, P

T. Baltru ˇsaitis, P. Robinson, and L.-P. Morency. Openface: An open source facial behavior analysis toolkit. In2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–10, 2016

work page 2016

[5] [5]

L. F. Barrett, K. A. Lindquist, and M. Gendron. Language as context for the perception of emotion.Trends in Cognitive Sciences, 11(8):327–332, 2007

work page 2007

[6] [6]

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

work page 2022

[7] [7]

Cheng, Z

H. Cheng, Z. Zhao, Y . He, Z. Hu, J. Li, M. Wang, and R. Hong. Vaemo: Efficient representation learning for visual-audio emotion with knowledge injection. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5547–5556, 2025

work page 2025

[8] [8]

Darwin.The Expression of the Emotions in Man and Animals

C. Darwin.The Expression of the Emotions in Man and Animals. John Murray, 1872

work page

[9] [9]

S. Du, Y . Tao, and A. M. Martinez. Compound facial expressions of emotion.Proceedings of the National Academy of Sciences, 111(15):E1454–E1462, 2014

work page 2014

[10] [10]

P. Ekman. An argument for basic emotions.Cognition & Emotion, 6(3-4):169–200, 1992

work page 1992

[11] [11]

Ekman and D

P. Ekman and D. Cordaro. What is meant by calling emotions basic. Emotion Review, 3(4):364–370, 2011

work page 2011

[12] [12]

Ganin, E

Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio- lette, M. Marchand, and V . Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

work page 2016

[13] [13]

Girdhar, A

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra. Imagebind one embedding space to bind them all. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

work page 2023

[14] [14]

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021

work page 2021

[15] [15]

J. Hu, L. Mathur, P. P. Liang, and L.-P. Morency. Openface 3.0: A lightweight multitask system for comprehensive facial behavior analysis. pages 1–11, 2025

work page 2025

[16] [16]

Israelsson, A

A. Israelsson, A. Seiger, and P. Laukka. Blended emotions can be accurately recognized from dynamic facial and vocal expressions. Journal of Nonverbal Behavior, 47(3):267–284, 2023

work page 2023

[17] [17]

S. K. Khare, V . Blanes-Vidal, E. S. Nadimi, and U. R. Acharya. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations.Information Fusion, 102:102019, 2024

work page 2014

[18] [18]

D. Kollias. Multi-label compound expression recognition: C-expr database & network. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5589–5598, 2023

work page 2023

[19] [19]

Kornblith, M

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019

work page 2019

[20] [20]

Lachmann, A

T. Lachmann, A. Israelsson, C. Tornberg, T. Saghinadze, M. Balazia, P. M¨uller, and P. Laukka. Not all blends are equal: The blemore dataset of blended emotion expressions with relative salience annotations, 2026

work page 2026

[21] [21]

H. Lian, C. Lu, S. Li, Y . Zhao, C. Tang, and Y . Zong. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face.Entropy, 25(10):1440, 2023

work page 2023

[22] [22]

Z. Lian, L. Sun, Y . Ren, H. Gu, H. Sun, L. Chen, B. Liu, and J. Tao. Merbench: A unified evaluation benchmark for multimodal emotion recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2026

work page 2026

[23] [23]

K. A. Lindquist, J. K. MacCormack, and H. Shablack. The role of language in emotion: Predictions from psychological constructionism. Frontiers in Psychology, 6:121301, 2015

work page 2015

[24] [24]

X. Mai, J. Lin, H. Wang, Z. Tao, et al. All rivers run into the sea: Unified modality brain-inspired emotional central mechanism. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 632–641, 2024

work page 2024

[25] [25]

Moeller, Z

J. Moeller, Z. Ivcevic Pringle, and A. White. Mixed emotions: Network analyses of intra-individual co-occurrences within and across situations.Emotion, 18:1106–1121, 2018

work page 2018

[26] [26]

Oatley and E

K. Oatley and E. Duncan. The experience of emotions in everyday life.Cognition & Emotion, 8(4):369–381, 1994

work page 1994

[27] [27]

Oh and E

V . Oh and E. Tong. Specificity in the study of mixed emotions: A theoretical framework.Personality and Social Psychology Review, 26(4):283–314, 2022

work page 2022

[28] [28]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Priyasad, T

D. Priyasad, T. Fernando, S. Denman, S. Sridharan, and C. Fookes. Attention driven fusion for multi-modal emotion recognition. pages 3227–3231, 2020

work page 2020

[30] [30]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, et al. Learning transfer- able visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, 2021

work page 2021

[31] [31]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervi- sion. InInternational conference on machine learning, pages 28492– 28518. PMLR, 2023

work page 2023

[32] [32]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

L. Sun, Z. Lian, B. Liu, and J. Tao. Mae-dfer: Efficient masked au- toencoder for self-supervised dynamic facial expression recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6110–6121, 2023

work page 2023

[35] [35]

L. Sun, Z. Lian, B. Liu, and J. Tao. Hicmae: Hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recog- nition.Information Fusion, 108:102382, 2024

work page 2024

[36] [36]

Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023

work page 2023

[38] [38]

P. Yang, N. Liu, X. Liu, Y . Shu, et al. A multimodal dataset for mixed emotion recognition.Scientific Data, 11, 2024

work page 2024

[39] [39]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023

work page 2023

[40] [40]

J. Zhao, Q. Yang, Y . Peng, D. Bai, et al. Humanomni: A large vision- speech language model for human-centric video understanding.arXiv preprint arXiv:2501.15111, 2025

work page arXiv 2025

[41] [41]

Zhao and J

Y . Zhao and J. Xu. Compound micro-expression recognition system. In2020 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), pages 728–733, 2020

work page 2020

[42] [42]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. APPENDIX A. Related Work Psychological foundations of blended emotions.Classic theories describe basic emotions as distinguishable af...

work page internal anchor Pith review Pith/arXiv arXiv 2025