pith. sign in

arxiv: 2605.21565 · v1 · pith:PEQHUR5Gnew · submitted 2026-05-20 · 💻 cs.LG

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

Pith reviewed 2026-05-22 09:39 UTC · model grok-4.3

classification 💻 cs.LG
keywords self-paced curriculum learningmultimodal emotion recognitionmodality imbalanceconversational emotion recognitiondifficulty measurerIEMOCAPMELDplug-and-play framework
0
0 comments X

The pith

Self-paced curriculum learning with dual-level scoring reduces modality imbalance in conversational emotion recognition and improves results by 1 to 10 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a plug-and-play self-paced curriculum learning framework to address modality misalignment and imbalanced learning in multimodal emotion recognition for conversations. It introduces a dual-level difficulty measurer that scores fine-grained challenges at the utterance level for each modality and broader structures at the conversation level including emotional dependencies and coherence. A learning scheduler then orders training to move from easier to harder instances based on those scores. This integration into existing architectures aims to prevent any single modality from dominating and to produce more robust models. A reader would care because more balanced multimodal training could yield reliable emotion detection that makes fuller use of language, voice, and facial cues in dialogues.

Core claim

The paper claims that a dual-level difficulty measurer within self-paced curriculum learning, which computes utterance-level modality-specific difficulty scores and conversation-level scores capturing emotional dependencies and modality coherence, when paired with a scheduler that guides training from easy to hard instances, alleviates modality imbalance when plugged into existing multimodal emotion recognition architectures and produces higher weighted F1 scores on IEMOCAP and MELD.

What carries the argument

The dual-level Difficulty Measurer that produces utterance-level scores for modality-specific difficulty and conversation-level scores for dialogue structures, together with the Learning Scheduler that orders instances from easier to more difficult according to those scores.

If this is right

  • Existing multimodal emotion recognition architectures gain performance without requiring changes to their core design.
  • All modalities contribute more evenly to predictions rather than one dominating the learned representation.
  • Model robustness increases across varying modality combinations and base architectures.
  • Training dynamics stabilize by sequencing examples according to measured difficulty instead of random order.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-level scoring idea could be tested on other multimodal sequence tasks where one input type tends to overshadow the others.
  • If the difficulty scores align with actual learning progress, they might support online adaptation in live dialogue applications.
  • Combining this scheduler with existing regularization methods could address imbalance in noisier, real-world conversation data.

Load-bearing premise

The dual-level Difficulty Measurer accurately captures utterance-level modality-specific difficulty and conversation-level structures including emotional dependencies and modality coherence in a manner that genuinely improves training dynamics.

What would settle it

Training the enhanced models on IEMOCAP or MELD with the difficulty measurer replaced by random instance ordering and finding that the reported performance gains no longer appear.

read the original abstract

Multimodal Emotion Recognition in Conversations (MERC) is a crucial task for understanding human interactions, where multimodal approaches integrating language, facial expressions, and vocal tone have achieved significant progress. However, modality misalignment and imbalanced learning remain major challenges, limiting the effective utilization of multimodal information. To address this issue, we propose a plug-and-play framework based on Self-Paced Curriculum Learning (SPCL) for MERC. We introduce a dual-level Difficulty Measurer that captures both utterance-level and conversation-level challenges. The utterance-level score models fine-grained modality-specific difficulty, while the conversation-level score captures broader dialogue structures, including emotional dependencies and modality coherence. Based on these scores, the Learning Scheduler dynamically guides training from easier to more difficult instances. By integrating SPCL into existing MERC architectures, our method alleviates modality imbalance and improves model robustness. Extensive experiments on the IEMOCAP and MELD datasets demonstrate consistent improvements across different architectures and modality settings. On IEMOCAP, SPCL improves weighted F1-score by approximately +1.2% to +6.6% over baseline models, while on MELD, gains reach up to +10.4%. These results highlight the effectiveness and generalizability of SPCL as a lightweight plug-and-play module for multimodal emotion recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a plug-and-play Self-Paced Curriculum Learning (SPCL) framework for Multimodal Emotion Recognition in Conversations (MERC) to address modality misalignment and imbalance. It introduces a dual-level Difficulty Measurer that computes utterance-level modality-specific difficulty scores and conversation-level scores incorporating emotional dependencies and modality coherence; a Learning Scheduler then orders training instances from easier to harder. The authors integrate this module into existing MERC architectures and report weighted F1 improvements of approximately +1.2% to +6.6% on IEMOCAP and up to +10.4% on MELD across multiple models and modality settings.

Significance. If the reported gains can be shown to arise specifically from the modality-aware components of the Difficulty Measurer rather than generic curriculum ordering, the work would supply a lightweight, architecture-agnostic technique for improving robustness in multimodal conversational tasks. The plug-and-play design and consistent gains across datasets would be practically useful, though the current evidence does not yet isolate the modality-balance mechanism.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (dual-level Difficulty Measurer): the central claim that SPCL alleviates modality imbalance specifically is not supported by an ablation that disables the utterance-level modality-specific score while retaining the conversation-level score. Without this control, the observed F1 gains remain consistent with any self-paced scheduler and do not demonstrate a unique contribution to modality balance.
  2. [§4] §4 (Experiments): no direct imbalance metric (e.g., per-modality accuracy variance or cross-modal loss disparity) is reported before versus after SPCL, and no statistical testing or error bars across multiple runs are provided. These omissions leave open the possibility that gains reflect generic curriculum effects or dataset-specific variance rather than the claimed modality-balance improvement.
minor comments (2)
  1. [Method] Method section: the precise formulation of the utterance-level modality-specific difficulty score (how language, visual, and audio difficulties are combined) is described at a high level but lacks explicit equations or pseudocode, hindering reproducibility.
  2. [Abstract] Abstract: the range of improvements (+1.2% to +6.6% on IEMOCAP) should specify which baseline architectures and modality combinations produce the lower versus upper ends of the range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The suggestions help clarify the unique contributions of our dual-level Difficulty Measurer. We address each major comment below and will incorporate revisions to strengthen the evidence for modality-balance improvements.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (dual-level Difficulty Measurer): the central claim that SPCL alleviates modality imbalance specifically is not supported by an ablation that disables the utterance-level modality-specific score while retaining the conversation-level score. Without this control, the observed F1 gains remain consistent with any self-paced scheduler and do not demonstrate a unique contribution to modality balance.

    Authors: We agree that an ablation isolating the utterance-level modality-specific component is necessary to substantiate its role in addressing modality imbalance beyond generic self-paced ordering. In the revised manuscript, we will add this control experiment: we will train variants using only the conversation-level score (with emotional dependencies and modality coherence) and compare against the full dual-level SPCL. Performance differences on IEMOCAP and MELD will be reported to show the incremental benefit of the modality-specific utterance-level scoring. revision: yes

  2. Referee: [§4] §4 (Experiments): no direct imbalance metric (e.g., per-modality accuracy variance or cross-modal loss disparity) is reported before versus after SPCL, and no statistical testing or error bars across multiple runs are provided. These omissions leave open the possibility that gains reflect generic curriculum effects or dataset-specific variance rather than the claimed modality-balance improvement.

    Authors: We acknowledge that direct metrics and statistical validation would more convincingly link gains to modality balance rather than generic curriculum effects. In the revised §4, we will report per-modality accuracy variance and cross-modal loss disparity before versus after SPCL application. We will also include error bars from multiple runs (at least 5 random seeds) and apply paired statistical significance tests (e.g., t-test) on the weighted F1 improvements to rule out dataset variance. revision: yes

Circularity Check

0 steps flagged

SPCL framework introduced as additive plug-and-play module with empirical gains; no definitional reduction or self-referential derivation

full rationale

The paper presents SPCL as an external curriculum technique integrated into existing MERC architectures via a newly defined dual-level Difficulty Measurer (utterance-level modality-specific scores plus conversation-level structure scores) and a Learning Scheduler. Claimed F1 improvements (+1.2% to +6.6% on IEMOCAP, up to +10.4% on MELD) are reported from experiments across architectures and datasets rather than derived by construction from fitted parameters or prior self-citations. No equations reduce the modality-balance alleviation to quantities defined within the paper's own inputs; the approach remains an empirical additive intervention whose central claims rest on observed performance deltas, not on self-definition or load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on standard machine learning assumptions plus the unverified effectiveness of the newly introduced dual-level difficulty scoring mechanism; no explicit free parameters, axioms, or invented entities are detailed.

axioms (1)
  • domain assumption Difficulty scores from the dual-level measurer reflect genuine learning challenges that benefit from curriculum ordering
    This premise underpins the entire learning scheduler and is invoked when claiming alleviation of modality imbalance.

pith-pipeline@v0.9.0 · 5777 in / 1302 out tokens · 42882 ms · 2026-05-22T09:39:55.822584+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

  1. [1]

    ACM Computing Surveys57(7), 1–34 (2025)

    Yuan, Y., Li, Z., Zhao, B.: A survey of multimodal learning: Methods, applica- tions, and future. ACM Computing Surveys57(7), 1–34 (2025)

  2. [2]

    IEEE transactions on pattern analysis and machine intelligence41(2), 423–443 (2018)

    Baltruˇ saitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence41(2), 423–443 (2018)

  3. [3]

    Advances in neural information processing systems 2021(DB1), 1 (2021)

    Liang, P.P., Lyu, Y., Fan, X., Wu, Z., Cheng, Y., Wu, J., Chen, L., Wu, P., Lee, M.A., Zhu, Y.,et al.: Multibench: Multiscale benchmarks for multi- modal representation learning. Advances in neural information processing systems 2021(DB1), 1 (2021)

  4. [4]

    ACM Computing Surveys56(10), 1–42 (2024)

    Liang, P.P., Zadeh, A., Morency, L.-P.: Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys56(10), 1–42 (2024)

  5. [5]

    Neurocomputing556, 126693 (2023)

    Gladys, A.A., Vetriselvi, V.: Survey on multimodal approaches to emotion recognition. Neurocomputing556, 126693 (2023)

  6. [6]

    In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp

    Ghosal, D., Majumder, N., Gelbukh, A., Mihalcea, R., Poria, S.: COSMIC: COm- monSense knowledge for eMotion identification in conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2470–2481. Association for Computational Linguistics, Online (2020). https://doi.org/10. 18653/v1/2020.findings-emnlp.224 28

  7. [7]

    Hu, J., Liu, Y., Zhao, J., Jin, Q.: Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5666–5675 (2021)

  8. [8]

    In: Bouamor, H., Pino, J., Bali, K

    Nguyen, C.V.T., Mai, T., The, S., Kieu, D., Le, D.-T.: Conversation understand- ing using relational temporal graph neural networks with auxiliary cross-modality interaction. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 15154– 15167. Association for Computational Lin...

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8238–8247 (2022)

  10. [10]

    12695–12705 (2020)

    Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classifica- tion networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705 (2020)

  11. [11]

    IEEE Transactions on Image Processing (2025)

    Shi, Q., Ye, M., Huang, W., Du, B., Zong, X.: Gradient and structure consis- tency in multimodal emotion recognition. IEEE Transactions on Image Processing (2025)

  12. [12]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Wang, Y., Liu, M., Li, Z., Hu, Y., Luo, X., Nie, L.: Unlocking the power of multimodal learning for emotion recognition in conversation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5947–5955 (2023)

  13. [13]

    Multimodal fusion on low-quality data: A comprehensive survey

    Zhang, Q., Wei, Y., Han, Z., Fu, H., Peng, X., Deng, C., Hu, Q., Xu, C., Wen, J., Hu, D., et al.: Multimodal fusion on low-quality data: A comprehensive survey. arXiv preprint arXiv:2404.18947 (2024)

  14. [14]

    In: International Conference on Machine Learning, pp

    Du, C., Teng, J., Li, T., Liu, Y., Yuan, T., Wang, Y., Yuan, Y., Zhao, H.: On uni-modal feature learning in supervised multi-modal learning. In: International Conference on Machine Learning, pp. 8632–8656 (2023). PMLR

  15. [15]

    In: International Conference on Machine Learning, pp

    Wu, N., Jastrzebski, S., Cho, K., Geras, K.J.: Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In: International Conference on Machine Learning, pp. 24043–24055 (2022). PMLR

  16. [16]

    In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

    Xu, R., Feng, R., Zhang, S.-X., Hu, D.: Mmcosine: Multi-modal cosine loss towards balanced audio-visual fine-grained learning. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). IEEE 29

  17. [17]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Zhou, Y., Wang, X., Chen, H., Duan, X., Zhu, W.: Intra-and inter-modal cur- riculum for multimodal learning. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 3724–3735 (2023)

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Fan, Y., Xu, W., Wang, H., Wang, J., Guo, S.: Pmr: Prototypical modal rebal- ance for multimodal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20029–20038 (2023)

  19. [19]

    In: IJCAI (2025)

    Nguyen, D.A., Kamboj, A., Do, M.N.: Robult: Leveraging redundancy and modality-specific features for robust multimodal learning. In: IJCAI (2025)

  20. [20]

    arXiv preprint arXiv:2011.06102 (2020)

    Ismail, A.A., Hasan, M., Ishtiaq, F.: Improving multimodal accuracy through modality pre-training and attention. arXiv preprint arXiv:2011.06102 (2020)

  21. [21]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Huang, C., Wei, Y., Yang, Z., Hu, D.: Adaptive unimodal regulation for balanced multimodal information acquisition. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 25854–25863 (2025)

  22. [22]

    In: Forty-second International Conference on Machine Learning (2025)

    Wu, Q., Shao, Y., Wang, J., Sun, X.: Learning optimal multimodal information bottleneck representations. In: Forty-second International Conference on Machine Learning (2025)

  23. [23]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Li, H., Li, X., Hu, P., Lei, Y., Li, C., Zhou, Y.: Boosting multi-modal model per- formance with adaptive gradient modulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22214–22224 (2023)

  24. [24]

    IEEE Transactions on Computational Social Systems (2025)

    Liu, F., Fu, Z., Wang, Y.: Reward-based gradient modulation for multimodal emo- tion recognition with lora. IEEE Transactions on Computational Social Systems (2025)

  25. [25]

    Language resources and evaluation42, 335–359 (2008)

    Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation42, 335–359 (2008)

  26. [26]

    In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp

    Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: A multimodal multi-party dataset for emotion recognition in conversations. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 527–536 (2019)

  27. [27]

    Ghosal, D., Majumder, N., Poria, S., Chhaya, N., Gelbukh, A.: Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 154–164 (2019)

  28. [28]

    In: Calzolari, N., Kan, 30 M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N

    Nguyen, C.-V.T., Nguyen, C.-B., Le, D.-T., Ha, Q.-T.: Curriculum learning meets directed acyclic graph for multimodal emotion recognition. In: Calzolari, N., Kan, 30 M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024...

  29. [29]

    Hu, D., Wei, L., Huai, X.: DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. In: Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7042–7052. Association for Computational Lingui...

  30. [30]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Shen, W., Chen, J., Quan, X., Xie, Z.: Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13789–13797 (2021)

  31. [31]

    In: Barzilay, R., Kan, M.-Y

    Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., Morency, L.-P.: Context-dependent sentiment analysis in user-generated videos. In: Barzilay, R., Kan, M.-Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873–883. Association for Computational Linguistics, Vancouver,...

  32. [32]

    In: Walker, M., Ji, H., Stent, A

    Hazarika, D., Poria, S., Zadeh, A., Cambria, E., Morency, L.-P., Zimmermann, R.: Conversational memory network for emotion recognition in dyadic dialogue videos. In: Walker, M., Ji, H., Stent, A. (eds.) Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1...

  33. [33]

    https://doi.org/10.18653/v1/N18-1193

    Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1193

  34. [34]

    In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J

    Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., Zimmermann, R.: ICON: Interactive conversational memory network for multimodal emotion detection. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2594–

  35. [35]

    https: //doi.org/10.18653/v1/D18-1280

    Association for Computational Linguistics, Brussels, Belgium (2018). https: //doi.org/10.18653/v1/D18-1280

  36. [36]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Li, B., Fei, H., Liao, L., Zhao, Y., Teng, C., Chua, T.-S., Ji, D., Li, F.: Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5923–5934 (2023)

  37. [37]

    In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pp

    Shi, T., Huang, S.-L.: MultiEMO: An attention-based correlation-aware multi- modal fusion framework for emotion recognition in conversations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pp. 14752–14766. Association for Computational 31 Linguistics, Toronto, Canada (2023). https://d...

  38. [38]

    In: Zong, C., Xia, F., Li, W., Navigli, R

    Delbrouck, J.-B., Tits, N., Brousmiche, M., Dupont, S.: A transformer-based joint-encoding for emotion recognition and sentiment analysis. In: Zadeh, A., Morency, L.-P., Liang, P.P., Poria, S. (eds.) Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), pp. 1–7. Association for Computational Linguistics, Seattle, USA (2020). https://...

  39. [39]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Tu, G., Xie, T., Liang, B., Wang, H., Xu, R.: Adaptive graph learning for multi- modal conversational emotion detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19089–19097 (2024)

  40. [40]

    In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp

    Joshi, A., Bhat, A., Jain, A., Singh, A., Modi, A.: Cogmen: Contextualized gnn based multimodal emotion recognition. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4148–4164 (2022)

  41. [41]

    In: International Conference on Machine Learning, pp

    Huang, Y., Lin, J., Zhou, C., Yang, H., Huang, L.: Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In: International Conference on Machine Learning, pp. 9226–9259 (2022). PMLR

  42. [42]

    In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

    Fan, Y., Xu, W., Wang, H., Liu, J., Guo, S.: Detached and interactive multi- modal learning. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 5470–5478 (2024)

  43. [43]

    In: European Conference on Computer Vision, pp

    Wei, Y., Li, S., Feng, R., Hu, D.: Diagnosing and re-learning for balanced multi- modal learning. In: European Conference on Computer Vision, pp. 71–86 (2025). Springer

  44. [44]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

    Guo, Z., Jin, T., Zhao, Z.: Multimodal prompt learning with missing modali- ties for sentiment analysis and emotion recognition. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1726–1736 (2024)

  45. [45]

    In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

    Nguyen, C.-V.T., Le, T.-S., Mai, A.-T., Le, D.-T.: Ada2i: Enhancing modality balance for multimodal conversational emotion recognition. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 9330–9339 (2024)

  46. [46]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

    Planamente, M., Plizzari, C., Alberti, E., Caputo, B.: Domain generalization through audio-visual relative norm alignment in first person action recogni- tion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1807–1818 (2022)

  47. [47]

    In: The Forty-first International Conference on Machine 32 Learning (2024)

    Hua, C., Xu, Q., Bao, S., Yang, Z., Huang, Q.: Reconboost: Boosting can achieve modality reconcilement. In: The Forty-first International Conference on Machine 32 Learning (2024)

  48. [48]

    arXiv preprint arXiv:2106.11059 (2021)

    Du, C., Li, T., Liu, Y., Wen, Z., Hua, T., Wang, Y., Zhao, H.: Improving multi- modal learning with uni-modal teachers. arXiv preprint arXiv:2106.11059 (2021)

  49. [49]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

    Wei, Y., Hu, D., Du, H., Wen, J.-R.: On-the-fly modulation for balanced multi- modal learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

  50. [50]

    In: Proceedings of the 26th Annual International Conference on Machine Learning, pp

    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009)

  51. [51]

    International Journal of Computer Vision130(6), 1526–1565 (2022)

    Soviany, P., Ionescu, R.T., Rota, P., Sebe, N.: Curriculum learning: A survey. International Journal of Computer Vision130(6), 1526–1565 (2022)

  52. [52]

    In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

    Wang, X., Zhou, Y., Chen, H., Zhu, W.: Curriculum learning for multimedia in the era of large language models. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11296–11297 (2024)

  53. [53]

    IEEE Transactions on Circuits and Systems for Video Technology33(3), 1305–1319 (2022)

    Tong, A., Tang, C., Wang, W.: Semi-supervised action recognition from tempo- ral augmentation using curriculum learning. IEEE Transactions on Circuits and Systems for Video Technology33(3), 1305–1319 (2022)

  54. [54]

    Neurocomput- ing620, 129195 (2025)

    Yu, T., Wang, J., Luo, J., Wang, J., Zhou, G.: Tacl: A trusted action-enhanced curriculum learning approach to multimodal affective computing. Neurocomput- ing620, 129195 (2025)

  55. [55]

    Journal of Machine Learning Research21(181), 1–50 (2020)

    Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E., Stone, P.: Cur- riculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research21(181), 1–50 (2020)

  56. [56]

    arXiv preprint arXiv:2503.06456 (2025)

    Qian, C., Han, K., Wang, J., Yuan, Z., Lyu, C., Chen, J., Liu, Z.: Dyncim: Dynamic curriculum for imbalanced multimodal learning. arXiv preprint arXiv:2503.06456 (2025)

  57. [57]

    Journal of memory and language64(2), 109–118 (2011)

    Tullis, J.G., Benjamin, A.S.: On the effectiveness of self-paced learning. Journal of memory and language64(2), 109–118 (2011)

  58. [58]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

    Han, K., Lyu, C., Ma, L., Qian, C., Ma, S., Pang, Z., Chen, J., Liu, Z.: Climd: A curriculum learning framework for imbalanced multimodal diagnosis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 65–74 (2025). Springer

  59. [59]

    In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp

    Zhou, Y., Liang, X., Xu, Y., Gao, B.: Sample-level self-paced learning to tackle multimodal imbalance problem. In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2025). IEEE 33

  60. [60]

    IEEE transactions on pattern analysis and machine intelligence44(9), 4555–4576 (2021)

    Wang, X., Chen, Y., Zhu, W.: A survey on curriculum learning. IEEE transactions on pattern analysis and machine intelligence44(9), 4555–4576 (2021)

  61. [61]

    In: International Conference on Machine Learning, pp

    Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep networks. In: International Conference on Machine Learning, pp. 2535–2544 (2019). PMLR

  62. [62]

    In: Proceedings of the 28th ACM International Conference on Multimedia, pp

    Zhang, D., Zhang, W., Li, S., Zhu, Q., Zhou, G.: Modeling both intra-and inter- modal influence for real-time emotion detection in conversations. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 503–511 (2020)

  63. [63]

    In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

    Hu, D., Hou, X., Wei, L., Jiang, L., Mo, Y.: Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7037–7041 (2022). IEEE

  64. [64]

    In: Proceedings of the 18th ACM International Conference on Multimedia, pp

    Eyben, F., W¨ ollmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462 (2010)

  65. [65]

    In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp

    Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.-P.: Openface 2.0: Facial behav- ior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66 (2018). IEEE

  66. [66]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pp. 3982–3992. Association for Computati...

  67. [67]

    Platanios, E.A., Stretcu, O., Neubig, G., Poczos, B., Mitchell, T.: Competence- based curriculum learning for neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers), pp. 1162–1172 (2019)

  68. [68]

    Scientific Data7(1), 293 (2020) 34

    Park, C.Y., Cha, N., Kang, S., Kim, A., Khandoker, A.H., Hadjileontiadis, L., Oh, A., Jeong, Y., Lee, U.: K-emocon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations. Scientific Data7(1), 293 (2020) 34