Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

Cam-Van Thi Nguyen; Duc-Trong Le; Phuong-Anh Nguyen; The-Son Le

arxiv: 2605.21565 · v1 · pith:PEQHUR5Gnew · submitted 2026-05-20 · 💻 cs.LG

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

Phuong-Anh Nguyen , The-Son Le , Duc-Trong Le , Cam-Van Thi Nguyen This is my paper

Pith reviewed 2026-05-22 09:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords self-paced curriculum learningmultimodal emotion recognitionmodality imbalanceconversational emotion recognitiondifficulty measurerIEMOCAPMELDplug-and-play framework

0 comments

The pith

Self-paced curriculum learning with dual-level scoring reduces modality imbalance in conversational emotion recognition and improves results by 1 to 10 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a plug-and-play self-paced curriculum learning framework to address modality misalignment and imbalanced learning in multimodal emotion recognition for conversations. It introduces a dual-level difficulty measurer that scores fine-grained challenges at the utterance level for each modality and broader structures at the conversation level including emotional dependencies and coherence. A learning scheduler then orders training to move from easier to harder instances based on those scores. This integration into existing architectures aims to prevent any single modality from dominating and to produce more robust models. A reader would care because more balanced multimodal training could yield reliable emotion detection that makes fuller use of language, voice, and facial cues in dialogues.

Core claim

The paper claims that a dual-level difficulty measurer within self-paced curriculum learning, which computes utterance-level modality-specific difficulty scores and conversation-level scores capturing emotional dependencies and modality coherence, when paired with a scheduler that guides training from easy to hard instances, alleviates modality imbalance when plugged into existing multimodal emotion recognition architectures and produces higher weighted F1 scores on IEMOCAP and MELD.

What carries the argument

The dual-level Difficulty Measurer that produces utterance-level scores for modality-specific difficulty and conversation-level scores for dialogue structures, together with the Learning Scheduler that orders instances from easier to more difficult according to those scores.

If this is right

Existing multimodal emotion recognition architectures gain performance without requiring changes to their core design.
All modalities contribute more evenly to predictions rather than one dominating the learned representation.
Model robustness increases across varying modality combinations and base architectures.
Training dynamics stabilize by sequencing examples according to measured difficulty instead of random order.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-level scoring idea could be tested on other multimodal sequence tasks where one input type tends to overshadow the others.
If the difficulty scores align with actual learning progress, they might support online adaptation in live dialogue applications.
Combining this scheduler with existing regularization methods could address imbalance in noisier, real-world conversation data.

Load-bearing premise

The dual-level Difficulty Measurer accurately captures utterance-level modality-specific difficulty and conversation-level structures including emotional dependencies and modality coherence in a manner that genuinely improves training dynamics.

What would settle it

Training the enhanced models on IEMOCAP or MELD with the difficulty measurer replaced by random instance ordering and finding that the reported performance gains no longer appear.

read the original abstract

Multimodal Emotion Recognition in Conversations (MERC) is a crucial task for understanding human interactions, where multimodal approaches integrating language, facial expressions, and vocal tone have achieved significant progress. However, modality misalignment and imbalanced learning remain major challenges, limiting the effective utilization of multimodal information. To address this issue, we propose a plug-and-play framework based on Self-Paced Curriculum Learning (SPCL) for MERC. We introduce a dual-level Difficulty Measurer that captures both utterance-level and conversation-level challenges. The utterance-level score models fine-grained modality-specific difficulty, while the conversation-level score captures broader dialogue structures, including emotional dependencies and modality coherence. Based on these scores, the Learning Scheduler dynamically guides training from easier to more difficult instances. By integrating SPCL into existing MERC architectures, our method alleviates modality imbalance and improves model robustness. Extensive experiments on the IEMOCAP and MELD datasets demonstrate consistent improvements across different architectures and modality settings. On IEMOCAP, SPCL improves weighted F1-score by approximately +1.2% to +6.6% over baseline models, while on MELD, gains reach up to +10.4%. These results highlight the effectiveness and generalizability of SPCL as a lightweight plug-and-play module for multimodal emotion recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts self-paced curriculum learning with a dual-level difficulty measurer for multimodal emotion recognition but does not fully demonstrate that the gains stem from modality balance rather than generic curriculum ordering.

read the letter

The key takeaway is that this work adds a self-paced curriculum learning module with a dual-level difficulty measurer to existing multimodal conversational emotion recognition models. It claims this helps with modality imbalance and delivers F1 improvements of 1-6% on IEMOCAP and up to 10% on MELD. The gains are real in the reported numbers, but the design does not yet separate the modality-specific contribution from ordinary curriculum effects.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a plug-and-play Self-Paced Curriculum Learning (SPCL) framework for Multimodal Emotion Recognition in Conversations (MERC) to address modality misalignment and imbalance. It introduces a dual-level Difficulty Measurer that computes utterance-level modality-specific difficulty scores and conversation-level scores incorporating emotional dependencies and modality coherence; a Learning Scheduler then orders training instances from easier to harder. The authors integrate this module into existing MERC architectures and report weighted F1 improvements of approximately +1.2% to +6.6% on IEMOCAP and up to +10.4% on MELD across multiple models and modality settings.

Significance. If the reported gains can be shown to arise specifically from the modality-aware components of the Difficulty Measurer rather than generic curriculum ordering, the work would supply a lightweight, architecture-agnostic technique for improving robustness in multimodal conversational tasks. The plug-and-play design and consistent gains across datasets would be practically useful, though the current evidence does not yet isolate the modality-balance mechanism.

major comments (2)

[Abstract and §3] Abstract and §3 (dual-level Difficulty Measurer): the central claim that SPCL alleviates modality imbalance specifically is not supported by an ablation that disables the utterance-level modality-specific score while retaining the conversation-level score. Without this control, the observed F1 gains remain consistent with any self-paced scheduler and do not demonstrate a unique contribution to modality balance.
[§4] §4 (Experiments): no direct imbalance metric (e.g., per-modality accuracy variance or cross-modal loss disparity) is reported before versus after SPCL, and no statistical testing or error bars across multiple runs are provided. These omissions leave open the possibility that gains reflect generic curriculum effects or dataset-specific variance rather than the claimed modality-balance improvement.

minor comments (2)

[Method] Method section: the precise formulation of the utterance-level modality-specific difficulty score (how language, visual, and audio difficulties are combined) is described at a high level but lacks explicit equations or pseudocode, hindering reproducibility.
[Abstract] Abstract: the range of improvements (+1.2% to +6.6% on IEMOCAP) should specify which baseline architectures and modality combinations produce the lower versus upper ends of the range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The suggestions help clarify the unique contributions of our dual-level Difficulty Measurer. We address each major comment below and will incorporate revisions to strengthen the evidence for modality-balance improvements.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (dual-level Difficulty Measurer): the central claim that SPCL alleviates modality imbalance specifically is not supported by an ablation that disables the utterance-level modality-specific score while retaining the conversation-level score. Without this control, the observed F1 gains remain consistent with any self-paced scheduler and do not demonstrate a unique contribution to modality balance.

Authors: We agree that an ablation isolating the utterance-level modality-specific component is necessary to substantiate its role in addressing modality imbalance beyond generic self-paced ordering. In the revised manuscript, we will add this control experiment: we will train variants using only the conversation-level score (with emotional dependencies and modality coherence) and compare against the full dual-level SPCL. Performance differences on IEMOCAP and MELD will be reported to show the incremental benefit of the modality-specific utterance-level scoring. revision: yes
Referee: [§4] §4 (Experiments): no direct imbalance metric (e.g., per-modality accuracy variance or cross-modal loss disparity) is reported before versus after SPCL, and no statistical testing or error bars across multiple runs are provided. These omissions leave open the possibility that gains reflect generic curriculum effects or dataset-specific variance rather than the claimed modality-balance improvement.

Authors: We acknowledge that direct metrics and statistical validation would more convincingly link gains to modality balance rather than generic curriculum effects. In the revised §4, we will report per-modality accuracy variance and cross-modal loss disparity before versus after SPCL application. We will also include error bars from multiple runs (at least 5 random seeds) and apply paired statistical significance tests (e.g., t-test) on the weighted F1 improvements to rule out dataset variance. revision: yes

Circularity Check

0 steps flagged

SPCL framework introduced as additive plug-and-play module with empirical gains; no definitional reduction or self-referential derivation

full rationale

The paper presents SPCL as an external curriculum technique integrated into existing MERC architectures via a newly defined dual-level Difficulty Measurer (utterance-level modality-specific scores plus conversation-level structure scores) and a Learning Scheduler. Claimed F1 improvements (+1.2% to +6.6% on IEMOCAP, up to +10.4% on MELD) are reported from experiments across architectures and datasets rather than derived by construction from fitted parameters or prior self-citations. No equations reduce the modality-balance alleviation to quantities defined within the paper's own inputs; the approach remains an empirical additive intervention whose central claims rest on observed performance deltas, not on self-definition or load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on standard machine learning assumptions plus the unverified effectiveness of the newly introduced dual-level difficulty scoring mechanism; no explicit free parameters, axioms, or invented entities are detailed.

axioms (1)

domain assumption Difficulty scores from the dual-level measurer reflect genuine learning challenges that benefit from curriculum ordering
This premise underpins the entire learning scheduler and is invoked when claiming alleviation of modality imbalance.

pith-pipeline@v0.9.0 · 5777 in / 1302 out tokens · 42882 ms · 2026-05-22T09:39:55.822584+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ρ_ij = 2 s_i l_ij / (s_i + l_ij) … hard regularizer g(ρ_ij, λ) that leads to a binary weighting v_ij = 1 if ρ_ij ≤ λ
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

conversation-level score s_i = σ(s^a_i, s^t_i, s^v_i) … modality misalignment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

[1]

ACM Computing Surveys57(7), 1–34 (2025)

Yuan, Y., Li, Z., Zhao, B.: A survey of multimodal learning: Methods, applica- tions, and future. ACM Computing Surveys57(7), 1–34 (2025)

work page 2025
[2]

IEEE transactions on pattern analysis and machine intelligence41(2), 423–443 (2018)

Baltruˇ saitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence41(2), 423–443 (2018)

work page 2018
[3]

Advances in neural information processing systems 2021(DB1), 1 (2021)

Liang, P.P., Lyu, Y., Fan, X., Wu, Z., Cheng, Y., Wu, J., Chen, L., Wu, P., Lee, M.A., Zhu, Y.,et al.: Multibench: Multiscale benchmarks for multi- modal representation learning. Advances in neural information processing systems 2021(DB1), 1 (2021)

work page 2021
[4]

ACM Computing Surveys56(10), 1–42 (2024)

Liang, P.P., Zadeh, A., Morency, L.-P.: Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys56(10), 1–42 (2024)

work page 2024
[5]

Neurocomputing556, 126693 (2023)

Gladys, A.A., Vetriselvi, V.: Survey on multimodal approaches to emotion recognition. Neurocomputing556, 126693 (2023)

work page 2023
[6]

In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp

Ghosal, D., Majumder, N., Gelbukh, A., Mihalcea, R., Poria, S.: COSMIC: COm- monSense knowledge for eMotion identification in conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2470–2481. Association for Computational Linguistics, Online (2020). https://doi.org/10. 18653/v1/2020.findings-emnlp.224 28

work page 2020
[7]

Hu, J., Liu, Y., Zhao, J., Jin, Q.: Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5666–5675 (2021)

work page 2021
[8]

In: Bouamor, H., Pino, J., Bali, K

Nguyen, C.V.T., Mai, T., The, S., Kieu, D., Le, D.-T.: Conversation understand- ing using relational temporal graph neural networks with auxiliary cross-modality interaction. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 15154– 15167. Association for Computational Lin...

work page doi:10.18653/v1/2023.emnlp-main.937 2023
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8238–8247 (2022)

work page 2022
[10]

12695–12705 (2020)

Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classifica- tion networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705 (2020)

work page 2020
[11]

IEEE Transactions on Image Processing (2025)

Shi, Q., Ye, M., Huang, W., Du, B., Zong, X.: Gradient and structure consis- tency in multimodal emotion recognition. IEEE Transactions on Image Processing (2025)

work page 2025
[12]

In: Proceedings of the 31st ACM International Conference on Multimedia, pp

Wang, Y., Liu, M., Li, Z., Hu, Y., Luo, X., Nie, L.: Unlocking the power of multimodal learning for emotion recognition in conversation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5947–5955 (2023)

work page 2023
[13]

Multimodal fusion on low-quality data: A comprehensive survey

Zhang, Q., Wei, Y., Han, Z., Fu, H., Peng, X., Deng, C., Hu, Q., Xu, C., Wen, J., Hu, D., et al.: Multimodal fusion on low-quality data: A comprehensive survey. arXiv preprint arXiv:2404.18947 (2024)

work page arXiv 2024
[14]

In: International Conference on Machine Learning, pp

Du, C., Teng, J., Li, T., Liu, Y., Yuan, T., Wang, Y., Yuan, Y., Zhao, H.: On uni-modal feature learning in supervised multi-modal learning. In: International Conference on Machine Learning, pp. 8632–8656 (2023). PMLR

work page 2023
[15]

In: International Conference on Machine Learning, pp

Wu, N., Jastrzebski, S., Cho, K., Geras, K.J.: Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In: International Conference on Machine Learning, pp. 24043–24055 (2022). PMLR

work page 2022
[16]

In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Xu, R., Feng, R., Zhang, S.-X., Hu, D.: Mmcosine: Multi-modal cosine loss towards balanced audio-visual fine-grained learning. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). IEEE 29

work page 2023
[17]

In: Proceedings of the 31st ACM International Conference on Multimedia, pp

Zhou, Y., Wang, X., Chen, H., Duan, X., Zhu, W.: Intra-and inter-modal cur- riculum for multimodal learning. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 3724–3735 (2023)

work page 2023
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Fan, Y., Xu, W., Wang, H., Wang, J., Guo, S.: Pmr: Prototypical modal rebal- ance for multimodal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20029–20038 (2023)

work page 2023
[19]

In: IJCAI (2025)

Nguyen, D.A., Kamboj, A., Do, M.N.: Robult: Leveraging redundancy and modality-specific features for robust multimodal learning. In: IJCAI (2025)

work page 2025
[20]

arXiv preprint arXiv:2011.06102 (2020)

Ismail, A.A., Hasan, M., Ishtiaq, F.: Improving multimodal accuracy through modality pre-training and attention. arXiv preprint arXiv:2011.06102 (2020)

work page arXiv 2011
[21]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Huang, C., Wei, Y., Yang, Z., Hu, D.: Adaptive unimodal regulation for balanced multimodal information acquisition. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 25854–25863 (2025)

work page 2025
[22]

In: Forty-second International Conference on Machine Learning (2025)

Wu, Q., Shao, Y., Wang, J., Sun, X.: Learning optimal multimodal information bottleneck representations. In: Forty-second International Conference on Machine Learning (2025)

work page 2025
[23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Li, H., Li, X., Hu, P., Lei, Y., Li, C., Zhou, Y.: Boosting multi-modal model per- formance with adaptive gradient modulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22214–22224 (2023)

work page 2023
[24]

IEEE Transactions on Computational Social Systems (2025)

Liu, F., Fu, Z., Wang, Y.: Reward-based gradient modulation for multimodal emo- tion recognition with lora. IEEE Transactions on Computational Social Systems (2025)

work page 2025
[25]

Language resources and evaluation42, 335–359 (2008)

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation42, 335–359 (2008)

work page 2008
[26]

In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp

Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: A multimodal multi-party dataset for emotion recognition in conversations. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 527–536 (2019)

work page 2019
[27]

Ghosal, D., Majumder, N., Poria, S., Chhaya, N., Gelbukh, A.: Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 154–164 (2019)

work page 2019
[28]

In: Calzolari, N., Kan, 30 M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N

Nguyen, C.-V.T., Nguyen, C.-B., Le, D.-T., Ha, Q.-T.: Curriculum learning meets directed acyclic graph for multimodal emotion recognition. In: Calzolari, N., Kan, 30 M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024...

work page 2024
[29]

Hu, D., Wei, L., Huai, X.: DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. In: Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7042–7052. Association for Computational Lingui...

work page doi:10.18653/v1/2021.acl-long.547 2021
[30]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Shen, W., Chen, J., Quan, X., Xie, Z.: Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13789–13797 (2021)

work page 2021
[31]

In: Barzilay, R., Kan, M.-Y

Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., Morency, L.-P.: Context-dependent sentiment analysis in user-generated videos. In: Barzilay, R., Kan, M.-Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873–883. Association for Computational Linguistics, Vancouver,...

work page 2017
[32]

In: Walker, M., Ji, H., Stent, A

Hazarika, D., Poria, S., Zadeh, A., Cambria, E., Morency, L.-P., Zimmermann, R.: Conversational memory network for emotion recognition in dyadic dialogue videos. In: Walker, M., Ji, H., Stent, A. (eds.) Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1...

work page 2018
[33]

https://doi.org/10.18653/v1/N18-1193

Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1193

work page doi:10.18653/v1/n18-1193 2018
[34]

In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J

Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., Zimmermann, R.: ICON: Interactive conversational memory network for multimodal emotion detection. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2594–

work page 2018
[35]

https: //doi.org/10.18653/v1/D18-1280

Association for Computational Linguistics, Brussels, Belgium (2018). https: //doi.org/10.18653/v1/D18-1280

work page doi:10.18653/v1/d18-1280 2018
[36]

In: Proceedings of the 31st ACM International Conference on Multimedia, pp

Li, B., Fei, H., Liao, L., Zhao, Y., Teng, C., Chua, T.-S., Ji, D., Li, F.: Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5923–5934 (2023)

work page 2023
[37]

In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pp

Shi, T., Huang, S.-L.: MultiEMO: An attention-based correlation-aware multi- modal fusion framework for emotion recognition in conversations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pp. 14752–14766. Association for Computational 31 Linguistics, Toronto, Canada (2023). https://d...

work page doi:10.18653/v1/2023.acl-long 2023
[38]

In: Zong, C., Xia, F., Li, W., Navigli, R

Delbrouck, J.-B., Tits, N., Brousmiche, M., Dupont, S.: A transformer-based joint-encoding for emotion recognition and sentiment analysis. In: Zadeh, A., Morency, L.-P., Liang, P.P., Poria, S. (eds.) Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), pp. 1–7. Association for Computational Linguistics, Seattle, USA (2020). https://...

work page doi:10.18653/v1/ 2020
[39]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Tu, G., Xie, T., Liang, B., Wang, H., Xu, R.: Adaptive graph learning for multi- modal conversational emotion detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19089–19097 (2024)

work page 2024
[40]

In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp

Joshi, A., Bhat, A., Jain, A., Singh, A., Modi, A.: Cogmen: Contextualized gnn based multimodal emotion recognition. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4148–4164 (2022)

work page 2022
[41]

In: International Conference on Machine Learning, pp

Huang, Y., Lin, J., Zhou, C., Yang, H., Huang, L.: Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In: International Conference on Machine Learning, pp. 9226–9259 (2022). PMLR

work page 2022
[42]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Fan, Y., Xu, W., Wang, H., Liu, J., Guo, S.: Detached and interactive multi- modal learning. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 5470–5478 (2024)

work page 2024
[43]

In: European Conference on Computer Vision, pp

Wei, Y., Li, S., Feng, R., Hu, D.: Diagnosing and re-learning for balanced multi- modal learning. In: European Conference on Computer Vision, pp. 71–86 (2025). Springer

work page 2025
[44]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Guo, Z., Jin, T., Zhao, Z.: Multimodal prompt learning with missing modali- ties for sentiment analysis and emotion recognition. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1726–1736 (2024)

work page 2024
[45]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Nguyen, C.-V.T., Le, T.-S., Mai, A.-T., Le, D.-T.: Ada2i: Enhancing modality balance for multimodal conversational emotion recognition. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 9330–9339 (2024)

work page 2024
[46]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Planamente, M., Plizzari, C., Alberti, E., Caputo, B.: Domain generalization through audio-visual relative norm alignment in first person action recogni- tion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1807–1818 (2022)

work page 2022
[47]

In: The Forty-first International Conference on Machine 32 Learning (2024)

Hua, C., Xu, Q., Bao, S., Yang, Z., Huang, Q.: Reconboost: Boosting can achieve modality reconcilement. In: The Forty-first International Conference on Machine 32 Learning (2024)

work page 2024
[48]

arXiv preprint arXiv:2106.11059 (2021)

Du, C., Li, T., Liu, Y., Wen, Z., Hua, T., Wang, Y., Zhao, H.: Improving multi- modal learning with uni-modal teachers. arXiv preprint arXiv:2106.11059 (2021)

work page arXiv 2021
[49]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

Wei, Y., Hu, D., Du, H., Wen, J.-R.: On-the-fly modulation for balanced multi- modal learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

work page 2024
[50]

In: Proceedings of the 26th Annual International Conference on Machine Learning, pp

Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009)

work page 2009
[51]

International Journal of Computer Vision130(6), 1526–1565 (2022)

Soviany, P., Ionescu, R.T., Rota, P., Sebe, N.: Curriculum learning: A survey. International Journal of Computer Vision130(6), 1526–1565 (2022)

work page 2022
[52]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Wang, X., Zhou, Y., Chen, H., Zhu, W.: Curriculum learning for multimedia in the era of large language models. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11296–11297 (2024)

work page 2024
[53]

IEEE Transactions on Circuits and Systems for Video Technology33(3), 1305–1319 (2022)

Tong, A., Tang, C., Wang, W.: Semi-supervised action recognition from tempo- ral augmentation using curriculum learning. IEEE Transactions on Circuits and Systems for Video Technology33(3), 1305–1319 (2022)

work page 2022
[54]

Neurocomput- ing620, 129195 (2025)

Yu, T., Wang, J., Luo, J., Wang, J., Zhou, G.: Tacl: A trusted action-enhanced curriculum learning approach to multimodal affective computing. Neurocomput- ing620, 129195 (2025)

work page 2025
[55]

Journal of Machine Learning Research21(181), 1–50 (2020)

Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., Stone, P.: Cur- riculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research21(181), 1–50 (2020)

work page 2020
[56]

arXiv preprint arXiv:2503.06456 (2025)

Qian, C., Han, K., Wang, J., Yuan, Z., Lyu, C., Chen, J., Liu, Z.: Dyncim: Dynamic curriculum for imbalanced multimodal learning. arXiv preprint arXiv:2503.06456 (2025)

work page arXiv 2025
[57]

Journal of memory and language64(2), 109–118 (2011)

Tullis, J.G., Benjamin, A.S.: On the effectiveness of self-paced learning. Journal of memory and language64(2), 109–118 (2011)

work page 2011
[58]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

Han, K., Lyu, C., Ma, L., Qian, C., Ma, S., Pang, Z., Chen, J., Liu, Z.: Climd: A curriculum learning framework for imbalanced multimodal diagnosis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 65–74 (2025). Springer

work page 2025
[59]

In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp

Zhou, Y., Liang, X., Xu, Y., Gao, B.: Sample-level self-paced learning to tackle multimodal imbalance problem. In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2025). IEEE 33

work page 2025
[60]

IEEE transactions on pattern analysis and machine intelligence44(9), 4555–4576 (2021)

Wang, X., Chen, Y., Zhu, W.: A survey on curriculum learning. IEEE transactions on pattern analysis and machine intelligence44(9), 4555–4576 (2021)

work page 2021
[61]

In: International Conference on Machine Learning, pp

Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep networks. In: International Conference on Machine Learning, pp. 2535–2544 (2019). PMLR

work page 2019
[62]

In: Proceedings of the 28th ACM International Conference on Multimedia, pp

Zhang, D., Zhang, W., Li, S., Zhu, Q., Zhou, G.: Modeling both intra-and inter- modal influence for real-time emotion detection in conversations. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 503–511 (2020)

work page 2020
[63]

In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Hu, D., Hou, X., Wei, L., Jiang, L., Mo, Y.: Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7037–7041 (2022). IEEE

work page 2022
[64]

In: Proceedings of the 18th ACM International Conference on Multimedia, pp

Eyben, F., W¨ ollmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462 (2010)

work page 2010
[65]

In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp

Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.-P.: Openface 2.0: Facial behav- ior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66 (2018). IEEE

work page 2018
[66]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 3982–3992. Association for Computati...

work page doi:10.18653/v1/d19-1410 2019
[67]

Platanios, E.A., Stretcu, O., Neubig, G., Poczos, B., Mitchell, T.: Competence- based curriculum learning for neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers), pp. 1162–1172 (2019)

work page 2019
[68]

Scientific Data7(1), 293 (2020) 34

Park, C.Y., Cha, N., Kang, S., Kim, A., Khandoker, A.H., Hadjileontiadis, L., Oh, A., Jeong, Y., Lee, U.: K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations. Scientific Data7(1), 293 (2020) 34

work page 2020

[1] [1]

ACM Computing Surveys57(7), 1–34 (2025)

Yuan, Y., Li, Z., Zhao, B.: A survey of multimodal learning: Methods, applica- tions, and future. ACM Computing Surveys57(7), 1–34 (2025)

work page 2025

[2] [2]

IEEE transactions on pattern analysis and machine intelligence41(2), 423–443 (2018)

Baltruˇ saitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence41(2), 423–443 (2018)

work page 2018

[3] [3]

Advances in neural information processing systems 2021(DB1), 1 (2021)

Liang, P.P., Lyu, Y., Fan, X., Wu, Z., Cheng, Y., Wu, J., Chen, L., Wu, P., Lee, M.A., Zhu, Y.,et al.: Multibench: Multiscale benchmarks for multi- modal representation learning. Advances in neural information processing systems 2021(DB1), 1 (2021)

work page 2021

[4] [4]

ACM Computing Surveys56(10), 1–42 (2024)

Liang, P.P., Zadeh, A., Morency, L.-P.: Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys56(10), 1–42 (2024)

work page 2024

[5] [5]

Neurocomputing556, 126693 (2023)

Gladys, A.A., Vetriselvi, V.: Survey on multimodal approaches to emotion recognition. Neurocomputing556, 126693 (2023)

work page 2023

[6] [6]

In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp

Ghosal, D., Majumder, N., Gelbukh, A., Mihalcea, R., Poria, S.: COSMIC: COm- monSense knowledge for eMotion identification in conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2470–2481. Association for Computational Linguistics, Online (2020). https://doi.org/10. 18653/v1/2020.findings-emnlp.224 28

work page 2020

[7] [7]

Hu, J., Liu, Y., Zhao, J., Jin, Q.: Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5666–5675 (2021)

work page 2021

[8] [8]

In: Bouamor, H., Pino, J., Bali, K

Nguyen, C.V.T., Mai, T., The, S., Kieu, D., Le, D.-T.: Conversation understand- ing using relational temporal graph neural networks with auxiliary cross-modality interaction. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 15154– 15167. Association for Computational Lin...

work page doi:10.18653/v1/2023.emnlp-main.937 2023

[9] [9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8238–8247 (2022)

work page 2022

[10] [10]

12695–12705 (2020)

Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classifica- tion networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705 (2020)

work page 2020

[11] [11]

IEEE Transactions on Image Processing (2025)

Shi, Q., Ye, M., Huang, W., Du, B., Zong, X.: Gradient and structure consis- tency in multimodal emotion recognition. IEEE Transactions on Image Processing (2025)

work page 2025

[12] [12]

In: Proceedings of the 31st ACM International Conference on Multimedia, pp

Wang, Y., Liu, M., Li, Z., Hu, Y., Luo, X., Nie, L.: Unlocking the power of multimodal learning for emotion recognition in conversation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5947–5955 (2023)

work page 2023

[13] [13]

Multimodal fusion on low-quality data: A comprehensive survey

Zhang, Q., Wei, Y., Han, Z., Fu, H., Peng, X., Deng, C., Hu, Q., Xu, C., Wen, J., Hu, D., et al.: Multimodal fusion on low-quality data: A comprehensive survey. arXiv preprint arXiv:2404.18947 (2024)

work page arXiv 2024

[14] [14]

In: International Conference on Machine Learning, pp

Du, C., Teng, J., Li, T., Liu, Y., Yuan, T., Wang, Y., Yuan, Y., Zhao, H.: On uni-modal feature learning in supervised multi-modal learning. In: International Conference on Machine Learning, pp. 8632–8656 (2023). PMLR

work page 2023

[15] [15]

In: International Conference on Machine Learning, pp

Wu, N., Jastrzebski, S., Cho, K., Geras, K.J.: Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In: International Conference on Machine Learning, pp. 24043–24055 (2022). PMLR

work page 2022

[16] [16]

In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Xu, R., Feng, R., Zhang, S.-X., Hu, D.: Mmcosine: Multi-modal cosine loss towards balanced audio-visual fine-grained learning. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). IEEE 29

work page 2023

[17] [17]

In: Proceedings of the 31st ACM International Conference on Multimedia, pp

Zhou, Y., Wang, X., Chen, H., Duan, X., Zhu, W.: Intra-and inter-modal cur- riculum for multimodal learning. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 3724–3735 (2023)

work page 2023

[18] [18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Fan, Y., Xu, W., Wang, H., Wang, J., Guo, S.: Pmr: Prototypical modal rebal- ance for multimodal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20029–20038 (2023)

work page 2023

[19] [19]

In: IJCAI (2025)

Nguyen, D.A., Kamboj, A., Do, M.N.: Robult: Leveraging redundancy and modality-specific features for robust multimodal learning. In: IJCAI (2025)

work page 2025

[20] [20]

arXiv preprint arXiv:2011.06102 (2020)

Ismail, A.A., Hasan, M., Ishtiaq, F.: Improving multimodal accuracy through modality pre-training and attention. arXiv preprint arXiv:2011.06102 (2020)

work page arXiv 2011

[21] [21]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Huang, C., Wei, Y., Yang, Z., Hu, D.: Adaptive unimodal regulation for balanced multimodal information acquisition. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 25854–25863 (2025)

work page 2025

[22] [22]

In: Forty-second International Conference on Machine Learning (2025)

Wu, Q., Shao, Y., Wang, J., Sun, X.: Learning optimal multimodal information bottleneck representations. In: Forty-second International Conference on Machine Learning (2025)

work page 2025

[23] [23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Li, H., Li, X., Hu, P., Lei, Y., Li, C., Zhou, Y.: Boosting multi-modal model per- formance with adaptive gradient modulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22214–22224 (2023)

work page 2023

[24] [24]

IEEE Transactions on Computational Social Systems (2025)

Liu, F., Fu, Z., Wang, Y.: Reward-based gradient modulation for multimodal emo- tion recognition with lora. IEEE Transactions on Computational Social Systems (2025)

work page 2025

[25] [25]

Language resources and evaluation42, 335–359 (2008)

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation42, 335–359 (2008)

work page 2008

[26] [26]

In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp

Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: A multimodal multi-party dataset for emotion recognition in conversations. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 527–536 (2019)

work page 2019

[27] [27]

Ghosal, D., Majumder, N., Poria, S., Chhaya, N., Gelbukh, A.: Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 154–164 (2019)

work page 2019

[28] [28]

In: Calzolari, N., Kan, 30 M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N

Nguyen, C.-V.T., Nguyen, C.-B., Le, D.-T., Ha, Q.-T.: Curriculum learning meets directed acyclic graph for multimodal emotion recognition. In: Calzolari, N., Kan, 30 M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024...

work page 2024

[29] [29]

Hu, D., Wei, L., Huai, X.: DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. In: Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7042–7052. Association for Computational Lingui...

work page doi:10.18653/v1/2021.acl-long.547 2021

[30] [30]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Shen, W., Chen, J., Quan, X., Xie, Z.: Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13789–13797 (2021)

work page 2021

[31] [31]

In: Barzilay, R., Kan, M.-Y

Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., Morency, L.-P.: Context-dependent sentiment analysis in user-generated videos. In: Barzilay, R., Kan, M.-Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873–883. Association for Computational Linguistics, Vancouver,...

work page 2017

[32] [32]

In: Walker, M., Ji, H., Stent, A

Hazarika, D., Poria, S., Zadeh, A., Cambria, E., Morency, L.-P., Zimmermann, R.: Conversational memory network for emotion recognition in dyadic dialogue videos. In: Walker, M., Ji, H., Stent, A. (eds.) Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1...

work page 2018

[33] [33]

https://doi.org/10.18653/v1/N18-1193

Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1193

work page doi:10.18653/v1/n18-1193 2018

[34] [34]

In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J

Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., Zimmermann, R.: ICON: Interactive conversational memory network for multimodal emotion detection. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2594–

work page 2018

[35] [35]

https: //doi.org/10.18653/v1/D18-1280

Association for Computational Linguistics, Brussels, Belgium (2018). https: //doi.org/10.18653/v1/D18-1280

work page doi:10.18653/v1/d18-1280 2018

[36] [36]

In: Proceedings of the 31st ACM International Conference on Multimedia, pp

Li, B., Fei, H., Liao, L., Zhao, Y., Teng, C., Chua, T.-S., Ji, D., Li, F.: Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5923–5934 (2023)

work page 2023

[37] [37]

In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pp

Shi, T., Huang, S.-L.: MultiEMO: An attention-based correlation-aware multi- modal fusion framework for emotion recognition in conversations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pp. 14752–14766. Association for Computational 31 Linguistics, Toronto, Canada (2023). https://d...

work page doi:10.18653/v1/2023.acl-long 2023

[38] [38]

In: Zong, C., Xia, F., Li, W., Navigli, R

Delbrouck, J.-B., Tits, N., Brousmiche, M., Dupont, S.: A transformer-based joint-encoding for emotion recognition and sentiment analysis. In: Zadeh, A., Morency, L.-P., Liang, P.P., Poria, S. (eds.) Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), pp. 1–7. Association for Computational Linguistics, Seattle, USA (2020). https://...

work page doi:10.18653/v1/ 2020

[39] [39]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Tu, G., Xie, T., Liang, B., Wang, H., Xu, R.: Adaptive graph learning for multi- modal conversational emotion detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19089–19097 (2024)

work page 2024

[40] [40]

In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp

Joshi, A., Bhat, A., Jain, A., Singh, A., Modi, A.: Cogmen: Contextualized gnn based multimodal emotion recognition. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4148–4164 (2022)

work page 2022

[41] [41]

In: International Conference on Machine Learning, pp

Huang, Y., Lin, J., Zhou, C., Yang, H., Huang, L.: Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In: International Conference on Machine Learning, pp. 9226–9259 (2022). PMLR

work page 2022

[42] [42]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Fan, Y., Xu, W., Wang, H., Liu, J., Guo, S.: Detached and interactive multi- modal learning. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 5470–5478 (2024)

work page 2024

[43] [43]

In: European Conference on Computer Vision, pp

Wei, Y., Li, S., Feng, R., Hu, D.: Diagnosing and re-learning for balanced multi- modal learning. In: European Conference on Computer Vision, pp. 71–86 (2025). Springer

work page 2025

[44] [44]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Guo, Z., Jin, T., Zhao, Z.: Multimodal prompt learning with missing modali- ties for sentiment analysis and emotion recognition. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1726–1736 (2024)

work page 2024

[45] [45]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Nguyen, C.-V.T., Le, T.-S., Mai, A.-T., Le, D.-T.: Ada2i: Enhancing modality balance for multimodal conversational emotion recognition. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 9330–9339 (2024)

work page 2024

[46] [46]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Planamente, M., Plizzari, C., Alberti, E., Caputo, B.: Domain generalization through audio-visual relative norm alignment in first person action recogni- tion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1807–1818 (2022)

work page 2022

[47] [47]

In: The Forty-first International Conference on Machine 32 Learning (2024)

Hua, C., Xu, Q., Bao, S., Yang, Z., Huang, Q.: Reconboost: Boosting can achieve modality reconcilement. In: The Forty-first International Conference on Machine 32 Learning (2024)

work page 2024

[48] [48]

arXiv preprint arXiv:2106.11059 (2021)

Du, C., Li, T., Liu, Y., Wen, Z., Hua, T., Wang, Y., Zhao, H.: Improving multi- modal learning with uni-modal teachers. arXiv preprint arXiv:2106.11059 (2021)

work page arXiv 2021

[49] [49]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

Wei, Y., Hu, D., Du, H., Wen, J.-R.: On-the-fly modulation for balanced multi- modal learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

work page 2024

[50] [50]

In: Proceedings of the 26th Annual International Conference on Machine Learning, pp

Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009)

work page 2009

[51] [51]

International Journal of Computer Vision130(6), 1526–1565 (2022)

Soviany, P., Ionescu, R.T., Rota, P., Sebe, N.: Curriculum learning: A survey. International Journal of Computer Vision130(6), 1526–1565 (2022)

work page 2022

[52] [52]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Wang, X., Zhou, Y., Chen, H., Zhu, W.: Curriculum learning for multimedia in the era of large language models. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11296–11297 (2024)

work page 2024

[53] [53]

IEEE Transactions on Circuits and Systems for Video Technology33(3), 1305–1319 (2022)

Tong, A., Tang, C., Wang, W.: Semi-supervised action recognition from tempo- ral augmentation using curriculum learning. IEEE Transactions on Circuits and Systems for Video Technology33(3), 1305–1319 (2022)

work page 2022

[54] [54]

Neurocomput- ing620, 129195 (2025)

Yu, T., Wang, J., Luo, J., Wang, J., Zhou, G.: Tacl: A trusted action-enhanced curriculum learning approach to multimodal affective computing. Neurocomput- ing620, 129195 (2025)

work page 2025

[55] [55]

Journal of Machine Learning Research21(181), 1–50 (2020)

Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., Stone, P.: Cur- riculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research21(181), 1–50 (2020)

work page 2020

[56] [56]

arXiv preprint arXiv:2503.06456 (2025)

Qian, C., Han, K., Wang, J., Yuan, Z., Lyu, C., Chen, J., Liu, Z.: Dyncim: Dynamic curriculum for imbalanced multimodal learning. arXiv preprint arXiv:2503.06456 (2025)

work page arXiv 2025

[57] [57]

Journal of memory and language64(2), 109–118 (2011)

Tullis, J.G., Benjamin, A.S.: On the effectiveness of self-paced learning. Journal of memory and language64(2), 109–118 (2011)

work page 2011

[58] [58]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

Han, K., Lyu, C., Ma, L., Qian, C., Ma, S., Pang, Z., Chen, J., Liu, Z.: Climd: A curriculum learning framework for imbalanced multimodal diagnosis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 65–74 (2025). Springer

work page 2025

[59] [59]

In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp

Zhou, Y., Liang, X., Xu, Y., Gao, B.: Sample-level self-paced learning to tackle multimodal imbalance problem. In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2025). IEEE 33

work page 2025

[60] [60]

IEEE transactions on pattern analysis and machine intelligence44(9), 4555–4576 (2021)

Wang, X., Chen, Y., Zhu, W.: A survey on curriculum learning. IEEE transactions on pattern analysis and machine intelligence44(9), 4555–4576 (2021)

work page 2021

[61] [61]

In: International Conference on Machine Learning, pp

Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep networks. In: International Conference on Machine Learning, pp. 2535–2544 (2019). PMLR

work page 2019

[62] [62]

In: Proceedings of the 28th ACM International Conference on Multimedia, pp

Zhang, D., Zhang, W., Li, S., Zhu, Q., Zhou, G.: Modeling both intra-and inter- modal influence for real-time emotion detection in conversations. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 503–511 (2020)

work page 2020

[63] [63]

In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Hu, D., Hou, X., Wei, L., Jiang, L., Mo, Y.: Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7037–7041 (2022). IEEE

work page 2022

[64] [64]

In: Proceedings of the 18th ACM International Conference on Multimedia, pp

Eyben, F., W¨ ollmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462 (2010)

work page 2010

[65] [65]

In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp

Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.-P.: Openface 2.0: Facial behav- ior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66 (2018). IEEE

work page 2018

[66] [66]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 3982–3992. Association for Computati...

work page doi:10.18653/v1/d19-1410 2019

[67] [67]

Platanios, E.A., Stretcu, O., Neubig, G., Poczos, B., Mitchell, T.: Competence- based curriculum learning for neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers), pp. 1162–1172 (2019)

work page 2019

[68] [68]

Scientific Data7(1), 293 (2020) 34

Park, C.Y., Cha, N., Kang, S., Kim, A., Khandoker, A.H., Hadjileontiadis, L., Oh, A., Jeong, Y., Lee, U.: K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations. Scientific Data7(1), 293 (2020) 34

work page 2020