pith. machine review for the scientific record. sign in

arxiv: 2605.06245 · v1 · submitted 2026-05-07 · 💻 cs.MM

Recognition: unknown

Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:06 UTC · model grok-4.3

classification 💻 cs.MM
keywords multimodal emotion recognitioncontrastive learninguncertainty regularizationmodality combinationsheterogeneous modalitiesrepresentation consistencyemotion classificationmultimodal fusion
0
0 comments X

The pith

The MCUR framework uses modality-aware contrastive learning and uncertainty regularization to achieve robust emotion recognition across varying modality combinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is trying to establish that a framework combining contrastive learning on samples with the same emotion and same available modalities, together with uncertainty-guided sample weighting, enables more consistent and reliable emotion predictions from multimodal inputs despite discrepancies in semantics, quality, and availability. This would matter if true because real-world applications in human-computer interaction frequently encounter incomplete or heterogeneous modality data, where standard methods falter. A sympathetic reader would see it as a way to make emotion understanding more dependable without requiring all modalities to be present at all times. The authors support this with experimental results showing performance gains on three widely used datasets.

Core claim

By introducing Modality Combination-Based and Category-Based Contrastive Learning (MCB-CL) to bring representations of same-category samples with identical modality sets closer together, and Sample-wise Uncertainty-Guided Regularization (SUGR) to adaptively weight samples during optimization, the MCUR framework delivers improved emotion recognition performance across heterogeneous modality combinations, with reported average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP compared to prior methods.

What carries the argument

Modality Combination-Based and Category-Based Contrastive Learning (MCB-CL) for representation alignment and Sample-wise Uncertainty-Guided Regularization (SUGR) for adaptive training weights within the overall Modality-Aware Contrastive and Uncertainty-Regularized (MCUR) framework.

If this is right

  • Models can handle cases where some modalities like audio or video are missing without significant performance drop.
  • Representation space becomes more structured by grouping according to both emotion category and modality availability.
  • Training focuses more on reliable samples, reducing the impact of noisy or uncertain data points.
  • The approach provides a general way to address modality heterogeneity in multimodal tasks.
  • Consistent improvements across different benchmark datasets suggest broad applicability in emotion recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Techniques like MCB-CL could extend to other multimodal domains such as visual question answering or speech translation where input combinations vary.
  • Combining this with modality imputation methods might further enhance performance in highly sparse settings.
  • Testing on datasets with real-time streaming modalities could reveal if the uncertainty regularization helps with temporal inconsistencies.
  • The contrastive mechanism might inspire similar consistency losses in non-emotion multimodal classification problems.

Load-bearing premise

The gains from the modality-combination contrastive loss and uncertainty regularization will hold up on data with modality availability patterns different from those in the training and test sets used.

What would settle it

A controlled test on a newly collected multimodal emotion dataset featuring modality combinations or quality distributions not seen in MOSI, MOSEI, or IEMOCAP, where the proposed method shows no F1 improvement or underperforms baselines.

Figures

Figures reproduced from arXiv: 2605.06245 by Fuji Ren, Jiawen Deng, Minhao Liu, Yanru Zhang, Yan Zhuang.

Figure 1
Figure 1. Figure 1: Illustration of the representation inconsistency challenges. (a) Intra-combination inconsis view at source ↗
Figure 2
Figure 2. Figure 2: The structure of the MCUR framework. MCUR includes a teacher model, a student model view at source ↗
Figure 3
Figure 3. Figure 3: Uncertainty estimation under different missing modality conditions. (a) NLL under different view at source ↗
Figure 4
Figure 4. Figure 4: The structure of the teacher model. A.1.1 Pre-processing Given a multimodal dataset with N samples D = {X1, X2, ..., Xm}, where Xp ∈ R tp×dp denotes the modality representations and p ∈ {1, 2, ..., m} denotes the modality. Here dp and tp represent the dimensionality and sequence length for each modality, respectively. To standardize the dimensionality of the representations, we apply a 1D convolutional net… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis on the training convergence on IEMOCAP. (a) view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of MCUR and its variants on IEMOCAP dataset with MR=0.7. (a) w/o view at source ↗
read the original abstract

Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combinations and posing significant challenges to achieving consistent and reliable emotion understanding. To address this challenge, we propose the Modality-Aware Contrastive and Uncertainty-Regularized (MCUR) framework, which approaches MER from the perspective of representation consistency, aiming to enable robust emotion prediction across heterogeneous modality combinations. MCUR incorporates two core components: (1) Modality Combination-Based and Category-Based Contrastive Learning mechanism (MCB-CL), which encourages samples with the same emotion category and the same available modalities to be close in the representation space; and (2) Sample-wise Uncertainty-Guided Regularization (SUGR), which adaptively assigns sample-wise uncertain weights to samples to optimize training. Extensive experiments demonstrate that MCUR consistently outperforms existing methods, achieving average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Modality-Aware Contrastive and Uncertainty-Regularized (MCUR) framework for multimodal emotion recognition (MER). It addresses heterogeneous modality combinations via two components: Modality Combination-Based and Category-Based Contrastive Learning (MCB-CL), which pulls same-emotion samples with identical available modalities closer in embedding space, and Sample-wise Uncertainty-Guided Regularization (SUGR), which applies adaptive per-sample uncertainty weights during training. The central claim is that MCUR yields robust emotion prediction across varying modality availabilities and outperforms prior methods, with reported average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP.

Significance. If the reported gains prove robust to modality distribution shifts and not attributable to dataset-specific correlations, the framework could meaningfully improve reliability of MER systems in real-world settings with missing or variable modalities. The empirical focus on representation consistency is a reasonable direction, but the absence of detailed ablations, error bars, or controls for modality availability patterns limits assessment of whether the gains reflect genuine cross-modal generalization.

major comments (3)
  1. [Abstract and Experiments] Abstract and §4 (Experiments): The headline F1 gains (2.2–4.37 %) are presented without error bars, statistical significance tests, or per-run variance, making it impossible to determine whether the improvements exceed noise or dataset-specific effects. This directly undermines the claim of consistent outperformance.
  2. [§3.2] §3.2 (MCB-CL): The contrastive loss pulls embeddings of same-emotion samples that share identical modality combinations. Because modality presence in the standard MOSI/MOSEI/IEMOCAP splits follows collection/annotation patterns rather than being randomized, the loss can minimize distance by encoding modality-combination identity instead of emotion semantics. No ablation or test-time evaluation on held-out or randomized modality combinations is described to rule out this shortcut.
  3. [§3.3 and Experiments] §3.3 (SUGR) and overall evaluation: Sample-wise uncertainty weights are learned jointly with the contrastive objective. This can amplify any modality-signature bias by down-weighting samples whose modality pattern deviates from the training distribution, yet no diagnostic (e.g., correlation between learned uncertainty and modality combination) or cross-dataset generalization test is reported.
minor comments (2)
  1. [§3.2] Notation for modality availability indicators and the precise form of the contrastive loss (positive/negative pair construction) should be formalized with equations rather than prose only.
  2. [Abstract] The abstract states 'extensive experiments' but provides no reference to supplementary material containing full hyper-parameter tables, ablation studies, or code.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thorough review and valuable suggestions. We appreciate the opportunity to clarify and strengthen our work. Below, we respond to each major comment, outlining the revisions we will make to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and §4 (Experiments): The headline F1 gains (2.2–4.37 %) are presented without error bars, statistical significance tests, or per-run variance, making it impossible to determine whether the improvements exceed noise or dataset-specific effects. This directly undermines the claim of consistent outperformance.

    Authors: We fully agree that the absence of error bars and statistical tests weakens the claims. In the revised manuscript, we will rerun the experiments multiple times (at least 5 runs per setting) with different random seeds, report the mean and standard deviation of F1 scores, and include p-values from appropriate statistical tests to demonstrate that the gains are significant. revision: yes

  2. Referee: [§3.2] §3.2 (MCB-CL): The contrastive loss pulls embeddings of same-emotion samples that share identical modality combinations. Because modality presence in the standard MOSI/MOSEI/IEMOCAP splits follows collection/annotation patterns rather than being randomized, the loss can minimize distance by encoding modality-combination identity instead of emotion semantics. No ablation or test-time evaluation on held-out or randomized modality combinations is described to rule out this shortcut.

    Authors: The referee correctly identifies a potential issue with shortcut learning in MCB-CL. While the design combines modality-combination and category-based contrastive terms to focus on emotion semantics, we did not explicitly test for this. We will add ablations in the revision, including training and evaluation on datasets with randomized modality availability patterns and held-out combinations, to confirm that performance gains are due to emotion understanding rather than modality identity. revision: yes

  3. Referee: [§3.3 and Experiments] §3.3 (SUGR) and overall evaluation: Sample-wise uncertainty weights are learned jointly with the contrastive objective. This can amplify any modality-signature bias by down-weighting samples whose modality pattern deviates from the training distribution, yet no diagnostic (e.g., correlation between learned uncertainty and modality combination) or cross-dataset generalization test is reported.

    Authors: This point is well-taken regarding possible bias amplification in SUGR. We will incorporate a new analysis in the revised paper that examines the relationship between the learned sample-wise uncertainty weights and the modality combinations present in the samples. Furthermore, we will include cross-dataset evaluation results to assess generalization beyond the training modality distributions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no self-referential derivations

full rationale

The paper presents an empirical ML framework (MCUR) consisting of a modality-combination contrastive loss (MCB-CL) and sample-wise uncertainty regularization (SUGR), evaluated via standard train/test splits on MOSI, MOSEI, and IEMOCAP. No equations, derivations, or first-principles predictions are offered that reduce to fitted parameters or self-citations by construction. Performance claims are purely experimental F1 deltas; the method does not invoke uniqueness theorems, rename known results, or smuggle ansatzes via self-citation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented physical entities are introduced; the work is an empirical machine-learning framework whose assumptions are implicit in the training procedure.

pith-pipeline@v0.9.0 · 5502 in / 1089 out tokens · 46819 ms · 2026-05-08T03:06:20.644717+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    OV-MER: Towards open-vocabulary multimodal emotion recognition,

    Z. Lian, H. Sun, L. Sun, H. Chen, L. Chen, H. Gu, Z. Wen, S. Chen, Z. Siyuan, H. Yao, B. Liu, R. Liu, S. Liang, Y . Li, J. Yi, and J. Tao, “OV-MER: Towards open-vocabulary multimodal emotion recognition,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=Y8lfuSoqQz

  2. [2]

    Catch your emotion: Sharpening emotion perception in multimodal large language models,

    Y . Fang, J. Liang, W. Huang, H. Li, K. Su, and M. Ye, “Catch your emotion: Sharpening emotion perception in multimodal large language models,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=IYOksPHJKT

  3. [3]

    Multimodal reaction: Information modulation for cross- modal representation learning,

    Y . Zeng, S. Mai, W. Yan, and H. Hu, “Multimodal reaction: Information modulation for cross- modal representation learning,”IEEE Transactions on Multimedia, vol. 26, pp. 2178–2191, 2023

  4. [4]

    Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition,

    Z. Guo, T. Jin, and Z. Zhao, “Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1726–1736

  5. [5]

    Distribution-consistent modal recovering for incomplete multimodal learning,

    Y . Wang, Z. Cui, and Y . Li, “Distribution-consistent modal recovering for incomplete multimodal learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 025–22 034

  6. [6]

    Towards robust multimodal sentiment analysis with incomplete data,

    H. Zhang, W. Wang, and T. Yu, “Towards robust multimodal sentiment analysis with incomplete data,”Advances in Neural Information Processing Systems, vol. 37, pp. 55 943–55 974, 2024

  7. [7]

    Correlation-decoupled knowledge distillation for multimodal sentiment analysis with incom- plete modalities,

    M. Li, D. Yang, X. Zhao, S. Wang, Y . Wang, K. Yang, M. Sun, D. Kou, Z. Qian, and L. Zhang, “Correlation-decoupled knowledge distillation for multimodal sentiment analysis with incom- plete modalities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 458–12 468

  8. [8]

    Mmanet: Margin-aware distillation and modality-aware regular- ization for incomplete multimodal learning,

    S. Wei, C. Luo, and Y . Luo, “Mmanet: Margin-aware distillation and modality-aware regular- ization for incomplete multimodal learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 039–20 049

  9. [9]

    Glomo: Global-local modal fusion for multimodal sentiment analysis,

    Y . Zhuang, Y . Zhang, Z. Hu, X. Zhang, J. Deng, and F. Ren, “Glomo: Global-local modal fusion for multimodal sentiment analysis,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1800–1809

  10. [10]

    Emoe: Modality-specific enhanced dynamic emotion experts,

    Y . Fang, W. Huang, G. Wan, K. Su, and M. Ye, “Emoe: Modality-specific enhanced dynamic emotion experts,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 314–14 324

  11. [11]

    Intra-sample and intra-modal enhancement for multimodal sentiment analysis with missing modalities,

    Y . Zhuang, Y . Zhang, J. Deng, and F. Ren, “Intra-sample and intra-modal enhancement for multimodal sentiment analysis with missing modalities,”IEEE Transactions on Multimedia, vol. 28, pp. 1847–1859, 2026

  12. [12]

    Incomplete multimodality-diffused emotion recognition,

    Y . Wang, Y . Li, and Z. Cui, “Incomplete multimodality-diffused emotion recognition,”Advances in Neural Information Processing Systems, vol. 36, pp. 17 117–17 128, 2023

  13. [13]

    Gcnet: Graph completion network for incomplete multimodal learning in conversation,

    Z. Lian, L. Chen, L. Sun, B. Liu, and J. Tao, “Gcnet: Graph completion network for incomplete multimodal learning in conversation,”IEEE Transactions on pattern analysis and machine intelligence, vol. 45, no. 7, pp. 8419–8432, 2023

  14. [14]

    Tmdc: A two-stage modality denoising and complementation framework for multimodal sentiment analysis with missing and noisy modalities,

    Y . Zhuang, M. Liu, Y . Zhang, J. Deng, and F. Ren, “Tmdc: A two-stage modality denoising and complementation framework for multimodal sentiment analysis with missing and noisy modalities,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 3, 2026, pp. 2281–2289

  15. [15]

    Distilling the Knowledge in a Neural Network

    G. Hinton, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

  16. [16]

    Similarity-preserving knowledge distillation,

    F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1365–1374. 10

  17. [17]

    To- ward robust incomplete multimodal sentiment analysis via hierarchical representation learning,

    M. Li, D. Yang, Y . Liu, S. Wang, J. Chen, S. Wang, J. Wei, Y . Jiang, Q. Xu, X. Houet al., “To- ward robust incomplete multimodal sentiment analysis via hierarchical representation learning,” Advances in Neural Information Processing Systems, vol. 37, pp. 28 515–28 536, 2024

  18. [18]

    A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities,

    M. Li, D. Yang, Y . Lei, S. Wang, S. Wang, L. Su, K. Yang, Y . Wang, M. Sun, and L. Zhang, “A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 9, 2024, pp. 10 074–10 082

  19. [19]

    Cmad: Correlation-aware and modalities-aware distillation for multimodal sentiment analysis with missing modalities,

    Y . Zhuang, M. Liu, W. Bai, Y . Zhang, X. Zhang, J. Deng, and F. Ren, “Cmad: Correlation-aware and modalities-aware distillation for multimodal sentiment analysis with missing modalities,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4626–4636

  20. [20]

    Perceiver: General perception with iterative attention,

    A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” inInternational conference on machine learning. PMLR, 2021, pp. 4651–4664

  21. [21]

    High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning,

    P. P. Liang, Y . Lyu, X. Fan, J. Tsaw, Y . Liu, S. Mo, D. Yogatama, L.-P. Morency, and R. Salakhutdinov, “High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning,”Transactions on Machine Learning Research, 2023. [Online]. Available: https://openreview.net/forum?id=ttzypy3kT7

  22. [22]

    Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis,

    H. Zhang, Y . Wang, G. Yin, K. Liu, Y . Liu, and T. Yu, “Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 756–767

  23. [23]

    Hyper-modality enhancement for multimodal sentiment analysis with missing modalities,

    Y . Zhuang, L. Minhao, W. Bai, Y . Zhang, W. Li, J. Deng, and F. Ren, “Hyper-modality enhancement for multimodal sentiment analysis with missing modalities,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  24. [24]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  25. [25]

    Misa: Modality-invariant and-specific repre- sentations for multimodal sentiment analysis,

    D. Hazarika, R. Zimmermann, and S. Poria, “Misa: Modality-invariant and-specific repre- sentations for multimodal sentiment analysis,” inProceedings of the 28th ACM international conference on multimedia, 2020, pp. 1122–1131

  26. [26]

    Deep variational information bottleneck,

    A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy, “Deep variational information bottleneck,” inInternational Conference on Learning Representations, 2022

  27. [27]

    Embracing unimodal aleatoric uncertainty for robust multimodal fusion,

    Z. Gao, X. Jiang, X. Xu, F. Shen, Y . Li, and H. T. Shen, “Embracing unimodal aleatoric uncertainty for robust multimodal fusion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 876–26 885

  28. [28]

    Auto-Encoding Variational Bayes

    D. P. Kingma, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

  29. [29]

    The information bottleneck method

    N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,”arXiv preprint physics/0004057, 2000

  30. [30]

    Neuro- inspired information-theoretic hierarchical perception for multimodal learning,

    X. Xiao, G. Liu, G. Gupta, D. Cao, S. Li, Y . Li, T. Fang, M. Cheng, and P. Bogdan, “Neuro- inspired information-theoretic hierarchical perception for multimodal learning,” inThe Twelfth International Conference on Learning Representations, 2024

  31. [31]

    Unimf: A unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences,

    R. Huan, G. Zhong, P. Chen, and R. Liang, “Unimf: A unified multimodal framework for multimodal sentiment analysis in missing modalities and unaligned multimodal sequences,” IEEE Transactions on Multimedia, vol. 26, pp. 5753–5768, 2023

  32. [32]

    Knowledge distillation with auxiliary variable,

    B. Peng, Z. Fang, G. Zhang, and J. Lu, “Knowledge distillation with auxiliary variable,” in International Conference on Machine Learning. PMLR, 2024, pp. 40 185–40 199

  33. [33]

    You never cluster alone,

    Y . Shen, Z. Shen, M. Wang, J. Qin, P. Torr, and L. Shao, “You never cluster alone,”Advances in Neural Information Processing Systems, vol. 34, pp. 27 734–27 746, 2021. 11

  34. [34]

    Mice: Mixture of contrastive experts for unsupervised image clustering,

    T. W. Tsai, C. Li, and J. Zhu, “Mice: Mixture of contrastive experts for unsupervised image clustering,” inInternational conference on learning representations, 2020

  35. [35]

    Decoupled knowledge distillation,

    B. Zhao, Q. Cui, R. Song, Y . Qiu, and J. Liang, “Decoupled knowledge distillation,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022, pp. 11 953–11 962

  36. [36]

    MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

    A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos,”arXiv preprint arXiv:1606.06259, 2016

  37. [37]

    Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,

    A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246

  38. [38]

    Iemocap: Interactive emotional dyadic motion capture database,

    C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,”Language resources and evaluation, vol. 42, pp. 335–359, 2008

  39. [39]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

  40. [40]

    Supervised contrastive learning,

    P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020

  41. [41]

    Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations,

    S. Mai, Y . Zeng, and H. Hu, “Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations,”IEEE Transactions on Multimedia, vol. 25, pp. 4121–4134, 2022

  42. [42]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Confer- ence on Learning Representations, 2018

  43. [43]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

  44. [44]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186

  45. [45]

    Glove: Global vectors for word representation,

    J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” inProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543

  46. [46]

    Openface: an open source facial behavior analysis toolkit,

    T. Baltrušaitis, P. Robinson, and L.-P. Morency, “Openface: an open source facial behavior analysis toolkit,” in2016 IEEE winter conference on applications of computer vision (WACV). IEEE, 2016, pp. 1–10

  47. [47]

    Covarep—a collaborative voice analysis repository for speech technologies,

    G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, “Covarep—a collaborative voice analysis repository for speech technologies,” in2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2014, pp. 960–964. A Appendix A.1 Additional Model Details In this section, we provide the details of the proposed MCUR. Th...