pith. sign in

arxiv: 2409.07388 · v3 · submitted 2024-09-11 · 💻 cs.CL

Recent Advances in Multimodal Affective Computing: An NLP Perspective

Pith reviewed 2026-05-23 21:17 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal affective computingsentiment analysisemotion recognitionNLPmultitask learningpre-trained modelsknowledge enhancementcontextual modeling
0
0 comments X

The pith

A survey establishes a unified view of multimodal affective computing by comparing four NLP tasks and organizing methods into four modeling paradigms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically reviews recent advances in multimodal affective computing from an NLP perspective. It centers on four representative tasks—multimodal sentiment analysis, multimodal emotion recognition in conversation, multimodal aspect-based sentiment analysis, and multimodal multi-label emotion recognition—and compares their task formulations, benchmark datasets, and evaluation protocols. Representative methods are organized into the paradigms of multitask learning, pre-trained models, knowledge enhancement, and contextual modeling. A sympathetic reader would care because the resulting structure makes patterns across a scattered literature visible and identifies shared challenges in interpreting human emotions and intentions. The review also extends to related modalities and emotion cause analysis while releasing a repository of works and resources.

Core claim

By examining the four tasks of multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER), and classifying representative methods into the paradigms of multitask learning, pre-trained models, knowledge enhancement, and contextual modeling, the survey establishes a unified view of the field that facilitates comparison across task formulations, datasets, and evaluation protocols while highlighting key challenges and future directions.

What carries the argument

The four representative tasks together with the four modeling paradigms that serve as the organizing structure for cross-task comparison and synthesis of the literature.

If this is right

  • Patterns in how methods handle multimodal inputs become comparable across tasks that previously appeared separate.
  • Gaps in benchmark datasets and evaluation protocols are identified for potential standardization.
  • The framework extends naturally to facial, acoustic, and physiological modalities as well as emotion cause analysis.
  • A curated repository of works and resources is supplied to support ongoing research.
  • Common challenges such as modality fusion and contextual understanding are positioned as priorities for future work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same four-paradigm lens could be applied to test whether emerging methods require an additional category.
  • NLP-derived insights on text-centric fusion might transfer to tasks that start from visual or audio data alone.
  • Quantifying performance differences across the four paradigms on shared datasets could reveal which approach scales best.
  • The organization may surface connections between affective computing and adjacent areas such as dialogue systems or human-AI interaction.

Load-bearing premise

The four chosen tasks and four listed modeling paradigms form a representative and non-arbitrary partition of the current literature.

What would settle it

Discovery of a substantial body of recent multimodal affective computing papers that fit none of the four tasks and none of the four paradigms.

Figures

Figures reproduced from arXiv: 2409.07388 by Chang Sun, Erik Cambria, Guimin Hu, Hasti Seifi, Lin Gui, Ruichu Cai, Weimin Lyu, Zhihong Zhu.

Figure 1
Figure 1. Figure 1: Taxonomy of multimodal affective computing from multimodal fusion and multimodal alignment. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of multimodal fusion from following aspects: 1) cross-modality modal fusion, 2) modal fusion based on modal consistency and difference [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration multimodal alignment:(a) semantic alignment and (b) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Taxonomy of multimodal affective computing works from aspects multitask learning, pre-trained model, enhanced knowledge and contextual information. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of multitask learning in multimodal affective computing [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An illustration of pre-trained model in multimodal affective computing [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of enhanced knowledge in multimodal affective computing [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of context information in multimodal affective computing [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Multimodal affective computing has gained increasing attention due to its broad applications in understanding human behavior and intentions, particularly in text-centric multimodal scenarios. Existing research spans diverse tasks, modalities, and modeling paradigms, yet lacks a unified perspective. In this survey, we systematically review recent advances from an NLP perspective, focusing on four representative tasks: multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER). We present a unified view by comparing task formulations, benchmark datasets, and evaluation protocols, and by organizing representative methods into key paradigms, including multitask learning, pre-trained models, knowledge enhancement, and contextual modeling. We further extend the discussion to related directions, such as facial, acoustic, and physiological modalities, as well as emotion cause analysis. Finally, we highlight key challenges and outline promising future directions. To facilitate further research, we release a curated repository of relevant works and resources \footnote{https://anonymous.4open.science/r/Multimodal-Affective-Computing-Survey-9819}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper is a survey reviewing recent advances in multimodal affective computing from an NLP perspective. It focuses on four tasks—multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER)—compares their formulations, benchmark datasets, and evaluation protocols, organizes representative methods into four paradigms (multitask learning, pre-trained models, knowledge enhancement, contextual modeling), extends discussion to related modalities and emotion cause analysis, highlights challenges and future directions, and releases a curated repository of resources.

Significance. If the four tasks and paradigms are shown to be representative, the survey would provide a useful organizational framework for the literature, enabling comparisons across task formulations, datasets, and methods while the released repository strengthens reproducibility and follow-on work. The organizational contribution is the primary value, as no new empirical results or derivations are claimed.

major comments (1)
  1. [Abstract and §1] Abstract and §1 (Introduction): the claim that the four tasks 'constitute a representative' partition sufficient for a 'unified view' is load-bearing for the central contribution, yet the manuscript provides no systematic search protocol, publication-count justification, overlap analysis across paradigms, or argument that omitted tasks (e.g., multimodal sarcasm detection) are marginal; without these the unification remains an ad-hoc organizing framework whose completeness cannot be assessed.
minor comments (2)
  1. [Abstract] The footnote URL for the repository is given as anonymous; a permanent link or DOI should be provided in the camera-ready version.
  2. Table or section enumerating the four paradigms would benefit from an explicit overlap matrix or decision tree showing how methods are assigned when they span multiple paradigms.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The primary concern raised is addressed in the point-by-point response below. We believe these revisions will improve the clarity and rigor of the survey's organizational framework.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): the claim that the four tasks 'constitute a representative' partition sufficient for a 'unified view' is load-bearing for the central contribution, yet the manuscript provides no systematic search protocol, publication-count justification, overlap analysis across paradigms, or argument that omitted tasks (e.g., multimodal sarcasm detection) are marginal; without these the unification remains an ad-hoc organizing framework whose completeness cannot be assessed.

    Authors: We acknowledge the validity of this observation. The manuscript presents the four tasks as representative based on their prominence in the NLP multimodal affective computing literature, but does not provide explicit justification or a search protocol. In the revised version, we will add a paragraph in Section 1 explaining the selection criteria, including approximate publication counts for each task drawn from major venues (e.g., ACL, EMNLP, CVPR), a discussion of paradigm overlaps, and a note on why tasks such as multimodal sarcasm detection are considered extensions of MSA rather than separate core tasks. This will better support the claim of a unified view without asserting completeness. We do not intend to perform a full PRISMA-style systematic review, as the survey's goal is to organize key paradigms rather than exhaustively catalog all work. revision: partial

Circularity Check

0 steps flagged

No circularity: survey paper with no derivations or predictions

full rationale

This is a literature survey that reviews existing work on four tasks (MSA, MERC, MABSA, MMER) and organizes methods into paradigms. It contains no equations, fitted parameters, predictions, or derivation chains that could reduce to inputs by construction. The claim of presenting a 'unified view' is an organizing framework for comparison of task formulations, datasets, and methods; it does not invoke self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that substitute for independent evidence. No patterns from the enumerated list apply. The paper is self-contained as a review and carries no circularity burden.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Survey paper containing no new models, derivations, or quantitative claims; therefore the ledger contains no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5744 in / 1210 out tokens · 22301 ms · 2026-05-23T21:17:13.571794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Hormone-inspired Emotion Layer for Transformer language models (HELT)

    cs.NE 2026-04 unverdicted novelty 7.0

    HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.

  2. Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning

    cs.AI 2025-11 unverdicted novelty 5.0

    PRC-Emo integrates prompt engineering, demonstration retrieval, and curriculum learning during LoRA fine-tuning to boost LLMs' emotion recognition in conversations, reaching new state-of-the-art results on IEMOCAP and MELD.

  3. Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects

    cs.HC 2025-10 unverdicted novelty 2.0

    A holistic survey of affective computing for intelligent agents covering emotion understanding via multimodal data, affective cognition, emotional expression synthesis, key challenges, and future directions emphasizin...

Reference graph

Works this paper leans on

298 extracted references · 298 canonical work pages · cited by 3 Pith papers · 11 internal anchors

  1. [1]

    Tfcd: Towards multi-modal sarcasm detection via training-free coun- terfactual debiasing,

    Z. Zhu, X. Zhuang, Y . Zhang, D. Xu, G. Hu, X. Wu, and Y . Zheng, “Tfcd: Towards multi-modal sarcasm detection via training-free coun- terfactual debiasing,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , K. Larson, Ed. International Joint Conferences on Artificial Intelligence Organization, 2024, ...

  2. [2]

    Towards multi- modal sarcasm detection via disentangled multi-grained multi-modal distilling,

    Z. Zhu, X. Cheng, G. Hu, Y . Li, Z. Huang, and Y . Zou, “Towards multi- modal sarcasm detection via disentangled multi-grained multi-modal distilling,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy , N. Calzolari, M. Kan, V . Hoste...

  3. [3]

    Ben-Ze’ev, The subtlety of emotions

    A. Ben-Ze’ev, The subtlety of emotions . MIT press, 2001

  4. [4]

    Emotions, sentiments, and performance expectations,

    R. K. Shelly, “Emotions, sentiments, and performance expectations,” in Theory and research on human emotions . Emerald Group Publishing Limited, 2004

  5. [5]

    R. J. Davidson, K. R. Sherer, and H. H. Goldsmith, Handbook of affective sciences. Oxford University Press, 2009

  6. [6]

    Modeling latent discriminative dynamic of multi-dimensional affective signals,

    G. A. Ram ´ırez, T. Baltrusaitis, and L. Morency, “Modeling latent discriminative dynamic of multi-dimensional affective signals,” in Affective Computing and Intelligent Interaction - Fourth International Conference, ACII 2011, Memphis, TN, USA, October 9-12, 2011, Proceedings, Part II, 2011, pp. 396–406

  7. [7]

    A multitask learning framework for multimodal sentiment analysis,

    D. Jiang, R. Wei, H. Liu, J. Wen, G. Tu, L. Zheng, and E. Cambria, “A multitask learning framework for multimodal sentiment analysis,” in 2021 International conference on data mining workshops (ICDMW). IEEE, 2021, pp. 151–157

  8. [8]

    Align before fuse: Vision and language representation learning with momentum distillation,

    J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Sys- tems 2021, NeurIPS 2021, December 6-14, 2021, virtual , M. Ranzato, A. Beygelz...

  9. [9]

    Modality distillation with multiple stream networks for action recognition,

    N. C. Garcia, P. Morerio, and V . Murino, “Modality distillation with multiple stream networks for action recognition,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII , ser. Lecture Notes in Computer Science, V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, Eds., vol. 11212. ...

  10. [10]

    Diversified multiple instance learning for document-level multi-aspect sentiment classification,

    Y . Ji, H. Liu, B. He, X. Xiao, H. Wu, and Y . Yu, “Diversified multiple instance learning for document-level multi-aspect sentiment classification,” in Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , 2020, pp. 7012– 7023

  11. [11]

    Identifying sentiment from crowd audio,

    P. J. Donnelly and A. Prestwich, “Identifying sentiment from crowd audio,” in 7th International Conference on Frontiers of Signal Pro- cessing, ICFSP 2022, Paris, France, September 7-9, 2022 , 2022, pp. 64–69. 18

  12. [12]

    Learning relation- ships between text, audio, and video via deep canonical correlation for multimodal language analysis,

    Z. Sun, P. K. Sarma, W. A. Sethares, and Y . Liang, “Learning relation- ships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI...

  13. [13]

    Factorized multimodal transformer for multimodal se- quential learning,

    A. Zadeh, C. Mao, K. Shi, Y . Zhang, P. P. Liang, S. Poria, and L. Morency, “Factorized multimodal transformer for multimodal se- quential learning,” CoRR, vol. abs/1911.09826, 2019

  14. [14]

    Visual attention model for name tagging in multimodal social media,

    D. Lu, L. Neves, V . Carvalho, N. Zhang, and H. Ji, “Visual attention model for name tagging in multimodal social media,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2018, pp. 1990–1999

  15. [15]

    Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,

    A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,” Inf. Fusion, vol. 91, pp. 424–444, 2023

  16. [16]

    Multi-interactive memory network for aspect based multimodal sentiment analysis,

    N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for aspect based multimodal sentiment analysis,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty- First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EA...

  17. [17]

    Multimodal emotion-cause pair extraction in conversations,

    F. Wang, Z. Ding, R. Xia, Z. Li, and J. Yu, “Multimodal emotion-cause pair extraction in conversations,” CoRR, vol. abs/2110.08020, 2021

  18. [18]

    Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt,

    X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y . Zhang, P. Hong, and S. Poria, “Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. 11 575–11 589

  19. [19]

    Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects,

    S. Rahmani, S. Hosseini, R. Zall, M. R. Kangavari, S. Kamran, and W. Hua, “Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects,” Knowl. Based Syst., vol. 261, p. 110219, 2023

  20. [20]

    QAP: A quantum-inspired adaptive-priority-learning model for multimodal emotion recognition,

    Z. Li, Y . Zhou, Y . Liu, F. Zhu, C. Yang, and S. Hu, “QAP: A quantum-inspired adaptive-priority-learning model for multimodal emotion recognition,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , 2023, pp. 12 191–12 204

  21. [21]

    Multimodal deep learning,

    J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y . Ng, “Multimodal deep learning,” in Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011 , L. Getoor and T. Scheffer, Eds. Omnipress, 2011, pp. 689–696

  22. [22]

    Integrating multimodal information in large pre- trained transformers,

    W. Rahman, M. K. Hasan, S. Lee, A. B. Zadeh, C. Mao, L. Morency, and M. E. Hoque, “Integrating multimodal information in large pre- trained transformers,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 2020, pp. 2359–2369

  23. [23]

    A survey on multi-task learning,

    Y . Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. Knowl. Data Eng. , vol. 34, no. 12, pp. 5586–5609, 2022

  24. [24]

    Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations,

    Y . Xie, K. Yang, C. Sun, B. Liu, and Z. Ji, “Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations,” in Findings of the Asso- ciation for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , M. Moens, X. Huang, L. Specia, ...

  25. [25]

    A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,

    W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , 2023, pp. 15 445–15 459

  26. [26]

    Unidu: Towards A unified generative dialogue understanding framework,

    Z. Chen, L. Chen, B. Chen, L. Qin, Y . Liu, S. Zhu, J. Lou, and K. Yu, “Unidu: Towards A unified generative dialogue understanding framework,” CoRR, vol. abs/2204.04637, 2022

  27. [27]

    Univilm: A unified video and language pre-training model for mul- timodal understanding and generation,

    H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou, “Univilm: A unified video and language pre-training model for mul- timodal understanding and generation,” CoRR, vol. abs/2002.06353, 2020

  28. [28]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 2021, pp. 8748– 8763

  29. [29]

    Vlmo: Unified vision-language pre- training with mixture-of-modality-experts,

    H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre- training with mixture-of-modality-experts,” in NeurIPS, 2022

  30. [30]

    Coca: Contrastive captioners are image-text foundation mod- els,

    J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation mod- els,” Trans. Mach. Learn. Res. , vol. 2022, 2022

  31. [31]

    V ATT: transformers for multimodal self-supervised learning from raw video, audio and text,

    H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y . Cui, and B. Gong, “V ATT: transformers for multimodal self-supervised learning from raw video, audio and text,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021, pp. 24 206–24 221

  32. [32]

    Parameter-efficient transfer learning for NLP,

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 2019, pp. 2790–2799

  33. [33]

    Prefix-tuning: Optimizing continuous prompts for generation,

    X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 , C. Zong, F. Xia, W. Li, and ...

  34. [34]

    Finetuned Language Models Are Zero-Shot Learners

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652 , 2021

  35. [35]

    Language models are few- shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...

  36. [36]

    A Survey on In-context Learning

    Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint arXiv:2301.00234, 2022

  37. [37]

    Multimodal prompt transformer with hybrid contrastive learning for emotion recognition in conversation,

    S. Zou, X. Huang, and X. Shen, “Multimodal prompt transformer with hybrid contrastive learning for emotion recognition in conversation,” CoRR, vol. abs/2310.04456, 2023

  38. [38]

    Unimse: Towards unified multimodal sentiment analysis and emotion recognition,

    G. Hu, T. Lin, Y . Zhao, G. Lu, Y . Wu, and Y . Li, “Unimse: Towards unified multimodal sentiment analysis and emotion recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , 2022, pp. 7837–7851

  39. [39]

    Affective computing in the era of large language models: A survey from the nlp perspective,

    Y . Zhang, X. Yang, X. Xu, Z. Gao, Y . Huang, S. Mu, S. Feng, D. Wang, Y . Zhang, K. Song et al. , “Affective computing in the era of large language models: A survey from the nlp perspective,” arXiv preprint arXiv:2408.04638, 2024

  40. [40]

    A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods,

    B. Pan, K. Hirota, Z. Jia, and Y . Dai, “A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods,” Neurocomputing, vol. 561, p. 126866, 2023

  41. [41]

    Emotion recognition from unimodal to multimodal analysis: A review,

    K. Ezzameli and H. Mahersia, “Emotion recognition from unimodal to multimodal analysis: A review,” Inf. Fusion, vol. 99, p. 101847, 2023

  42. [42]

    A review of chinese sentiment analysis: Subjects, methods, and trends

    Z. W ANG, X. ZHANG, J. CUI, S.-B. HO, and E. CAMBRIA, “A review of chinese sentiment analysis: Subjects, methods, and trends.”

  43. [43]

    Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,

    A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,” Information Fusion, vol. 91, pp. 424–444, 2023

  44. [44]

    Multimodal sentiment analysis based on fusion methods: A survey,

    L. Zhu, Z. Zhu, C. Zhang, Y . Xu, and X. Kong, “Multimodal sentiment analysis based on fusion methods: A survey,” Inf. Fusion, vol. 95, pp. 306–325, 2023

  45. [45]

    Sentiment classification using doc- ument embeddings trained with cosine similarity,

    T. Thongtan and T. Phienthrakul, “Sentiment classification using doc- ument embeddings trained with cosine similarity,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 2: Student Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds. Associatio...

  46. [46]

    Towards multimodal sentiment analysis: harvesting opinions from the web,

    L. Morency, R. Mihalcea, and P. Doshi, “Towards multimodal sentiment analysis: harvesting opinions from the web,” in Proceedings of the 13th International Conference on Multimodal Interfaces, ICMI 2011, Alicante, Spain, November 14-18, 2011 , 2011, pp. 169–176

  47. [47]

    Survey on multimodal approaches to emotion recognition,

    A. G. A. and V . Vetriselvi, “Survey on multimodal approaches to emotion recognition,” Neurocomputing, vol. 556, p. 126693, 2023

  48. [48]

    A discourse-aware graph neural network for emotion recognition in multi-party conversation,

    Y . Sun, N. Yu, and G. Fu, “A discourse-aware graph neural network for emotion recognition in multi-party conversation,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational Linguistic...

  49. [49]

    COGMEN: contextualized GNN based multimodal emotion recognition,

    A. Joshi, A. Bhat, A. Jain, A. V . Singh, and A. Modi, “COGMEN: contextualized GNN based multimodal emotion recognition,” CoRR, vol. abs/2205.02455, 2022

  50. [50]

    Multi-interactive memory network for aspect based multimodal sentiment analysis,

    N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for aspect based multimodal sentiment analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, no. 01, 2019, pp. 371–378

  51. [51]

    Transfer capsule network for aspect level sentiment classification,

    Z. Chen and T. Qian, “Transfer capsule network for aspect level sentiment classification,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , A. Korhonen, D. R. Traum, and L. M `arquez, Eds. Association for Computational Linguistics, 2019, pp. 547–556

  52. [52]

    A unified generative framework for aspect-based sentiment analysis,

    H. Yan, J. Dai, T. Ji, X. Qiu, and Z. Zhang, “A unified generative framework for aspect-based sentiment analysis,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 ,...

  53. [53]

    Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis,

    C. Li, F. Gao, J. Bu, L. Xu, X. Chen, Y . Gu, Z. Shao, Q. Zheng, N. Zhang, Y . Wang, and Z. Yu, “Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis,” CoRR, vol. abs/2109.08306, 2021

  54. [54]

    SGM: sequence generation model for multi-label classification,

    P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “SGM: sequence generation model for multi-label classification,” in Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018 , 2018, pp. 3915–3926

  55. [55]

    Label-specific dual graph neural network for multi-label text classification,

    Q. Ma, C. Yuan, W. Zhou, and S. Hu, “Label-specific dual graph neural network for multi-label text classification,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 ,...

  56. [56]

    Distributed representations of words and phrases and their compo- sitionality,

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compo- sitionality,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems

  57. [57]

    3111–3119

    Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States , 2013, pp. 3111–3119

  58. [58]

    Glove: Global vectors for word representation,

    J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL , 2014, pp. 1532–1543

  59. [59]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, ...

  60. [60]

    BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,

    M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5- 10, 2020, 2020, pp....

  61. [61]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res. , vol. 21, pp. 140:1–140:67, 2020

  62. [62]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, vol. abs/2302.13971, 2023

  63. [63]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. K...

  64. [64]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” CoRR, vol. abs/2312.00752, 2023

  65. [65]

    AST: audio spectrogram transformer,

    Y . Gong, Y . Chung, and J. R. Glass, “AST: audio spectrogram transformer,” in 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , H. Hermansky, H. Cernock ´y, L. Burget, L. Lamel, O. Scharenborg, and P. Motl ´ıcek, Eds. ISCA, 2021, pp. 571–575

  66. [66]

    Efficientnet: Rethinking model scaling for convolutional neural networks,

    M. Tan and Q. V . Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the 36th Interna- tional Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 6105–6114

  67. [67]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenRev...

  68. [68]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning , 2021, pp. 8748–8763

  69. [69]

    Gpt-4v(ision) system card,

    “Gpt-4v(ision) system card,” 2023. [Online]. Available: https: //api.semanticscholar.org/CorpusID:263218031

  70. [70]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

  71. [71]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al. , “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems , vol. 35, pp. 23 716–23 736, 2022

  72. [72]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597 , 2023

  73. [73]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En- hancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023

  74. [74]

    Instructblip: Towards general-purpose vision- language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023

  75. [75]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems , vol. 36, 2024

  76. [76]

    Scaling instruction-finetuned language models,

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research , vol. 25, no. 70, pp. 1–53, 2024

  77. [77]

    Tensor fusion network for multimodal sentiment analysis,

    A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, M. Palmer, R. Hwa, and S. Riedel, Eds. Association for Computational Linguistics, 2017, pp. 1103–1114

  78. [78]

    Multimodal transformer for unaligned multimodal language sequences,

    Y . H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , A. Korhonen, D. R. Traum, and L. M `arque...

  79. [79]

    Inter-intra modal representa- tion augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis,

    C. Chen, H. Hong, J. Guo, and B. Song, “Inter-intra modal representa- tion augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 1476–1488, 2023

  80. [80]

    CM-BERT: cross-modal BERT for text- audio sentiment analysis,

    K. Yang, H. Xu, and K. Gao, “CM-BERT: cross-modal BERT for text- audio sentiment analysis,” in MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, 2020, pp. 521–528

Showing first 80 references.