Recent Advances in Multimodal Affective Computing: An NLP Perspective

Chang Sun; Erik Cambria; Guimin Hu; Hasti Seifi; Lin Gui; Ruichu Cai; Weimin Lyu; Zhihong Zhu

arxiv: 2409.07388 · v3 · submitted 2024-09-11 · 💻 cs.CL

Recent Advances in Multimodal Affective Computing: An NLP Perspective

Guimin Hu , Weimin Lyu , Chang Sun , Zhihong Zhu , Lin Gui , Ruichu Cai , Erik Cambria , Hasti Seifi This is my paper

Pith reviewed 2026-05-23 21:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal affective computingsentiment analysisemotion recognitionNLPmultitask learningpre-trained modelsknowledge enhancementcontextual modeling

0 comments

The pith

A survey establishes a unified view of multimodal affective computing by comparing four NLP tasks and organizing methods into four modeling paradigms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically reviews recent advances in multimodal affective computing from an NLP perspective. It centers on four representative tasks—multimodal sentiment analysis, multimodal emotion recognition in conversation, multimodal aspect-based sentiment analysis, and multimodal multi-label emotion recognition—and compares their task formulations, benchmark datasets, and evaluation protocols. Representative methods are organized into the paradigms of multitask learning, pre-trained models, knowledge enhancement, and contextual modeling. A sympathetic reader would care because the resulting structure makes patterns across a scattered literature visible and identifies shared challenges in interpreting human emotions and intentions. The review also extends to related modalities and emotion cause analysis while releasing a repository of works and resources.

Core claim

By examining the four tasks of multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER), and classifying representative methods into the paradigms of multitask learning, pre-trained models, knowledge enhancement, and contextual modeling, the survey establishes a unified view of the field that facilitates comparison across task formulations, datasets, and evaluation protocols while highlighting key challenges and future directions.

What carries the argument

The four representative tasks together with the four modeling paradigms that serve as the organizing structure for cross-task comparison and synthesis of the literature.

If this is right

Patterns in how methods handle multimodal inputs become comparable across tasks that previously appeared separate.
Gaps in benchmark datasets and evaluation protocols are identified for potential standardization.
The framework extends naturally to facial, acoustic, and physiological modalities as well as emotion cause analysis.
A curated repository of works and resources is supplied to support ongoing research.
Common challenges such as modality fusion and contextual understanding are positioned as priorities for future work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same four-paradigm lens could be applied to test whether emerging methods require an additional category.
NLP-derived insights on text-centric fusion might transfer to tasks that start from visual or audio data alone.
Quantifying performance differences across the four paradigms on shared datasets could reveal which approach scales best.
The organization may surface connections between affective computing and adjacent areas such as dialogue systems or human-AI interaction.

Load-bearing premise

The four chosen tasks and four listed modeling paradigms form a representative and non-arbitrary partition of the current literature.

What would settle it

Discovery of a substantial body of recent multimodal affective computing papers that fit none of the four tasks and none of the four paradigms.

Figures

Figures reproduced from arXiv: 2409.07388 by Chang Sun, Erik Cambria, Guimin Hu, Hasti Seifi, Lin Gui, Ruichu Cai, Weimin Lyu, Zhihong Zhu.

**Figure 2.** Figure 2: Illustration of multimodal fusion from following aspects: 1) cross-modality modal fusion, 2) modal fusion based on modal consistency and difference [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration multimodal alignment:(a) semantic alignment and (b) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Taxonomy of multimodal affective computing works from aspects multitask learning, pre-trained model, enhanced knowledge and contextual information. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of multitask learning in multimodal affective computing [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: An illustration of pre-trained model in multimodal affective computing [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of enhanced knowledge in multimodal affective computing [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Illustration of context information in multimodal affective computing [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Multimodal affective computing has gained increasing attention due to its broad applications in understanding human behavior and intentions, particularly in text-centric multimodal scenarios. Existing research spans diverse tasks, modalities, and modeling paradigms, yet lacks a unified perspective. In this survey, we systematically review recent advances from an NLP perspective, focusing on four representative tasks: multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER). We present a unified view by comparing task formulations, benchmark datasets, and evaluation protocols, and by organizing representative methods into key paradigms, including multitask learning, pre-trained models, knowledge enhancement, and contextual modeling. We further extend the discussion to related directions, such as facial, acoustic, and physiological modalities, as well as emotion cause analysis. Finally, we highlight key challenges and outline promising future directions. To facilitate further research, we release a curated repository of relevant works and resources \footnote{https://anonymous.4open.science/r/Multimodal-Affective-Computing-Survey-9819}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes four multimodal affective tasks and four method paradigms from an NLP angle and ships a repo, but the representativeness claim lacks selection criteria or coverage data.

read the letter

The main takeaway is that this is a standard literature survey that groups existing work on MSA, MERC, MABSA, and MMER, compares their formulations, datasets, and protocols, and sorts methods into multitask learning, pre-trained models, knowledge enhancement, and contextual modeling. It also releases a public repository of papers and resources. That structure and the repo are the concrete contributions; nothing new is derived or measured here.

Referee Report

1 major / 2 minor

Summary. The paper is a survey reviewing recent advances in multimodal affective computing from an NLP perspective. It focuses on four tasks—multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER)—compares their formulations, benchmark datasets, and evaluation protocols, organizes representative methods into four paradigms (multitask learning, pre-trained models, knowledge enhancement, contextual modeling), extends discussion to related modalities and emotion cause analysis, highlights challenges and future directions, and releases a curated repository of resources.

Significance. If the four tasks and paradigms are shown to be representative, the survey would provide a useful organizational framework for the literature, enabling comparisons across task formulations, datasets, and methods while the released repository strengthens reproducibility and follow-on work. The organizational contribution is the primary value, as no new empirical results or derivations are claimed.

major comments (1)

[Abstract and §1] Abstract and §1 (Introduction): the claim that the four tasks 'constitute a representative' partition sufficient for a 'unified view' is load-bearing for the central contribution, yet the manuscript provides no systematic search protocol, publication-count justification, overlap analysis across paradigms, or argument that omitted tasks (e.g., multimodal sarcasm detection) are marginal; without these the unification remains an ad-hoc organizing framework whose completeness cannot be assessed.

minor comments (2)

[Abstract] The footnote URL for the repository is given as anonymous; a permanent link or DOI should be provided in the camera-ready version.
Table or section enumerating the four paradigms would benefit from an explicit overlap matrix or decision tree showing how methods are assigned when they span multiple paradigms.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The primary concern raised is addressed in the point-by-point response below. We believe these revisions will improve the clarity and rigor of the survey's organizational framework.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): the claim that the four tasks 'constitute a representative' partition sufficient for a 'unified view' is load-bearing for the central contribution, yet the manuscript provides no systematic search protocol, publication-count justification, overlap analysis across paradigms, or argument that omitted tasks (e.g., multimodal sarcasm detection) are marginal; without these the unification remains an ad-hoc organizing framework whose completeness cannot be assessed.

Authors: We acknowledge the validity of this observation. The manuscript presents the four tasks as representative based on their prominence in the NLP multimodal affective computing literature, but does not provide explicit justification or a search protocol. In the revised version, we will add a paragraph in Section 1 explaining the selection criteria, including approximate publication counts for each task drawn from major venues (e.g., ACL, EMNLP, CVPR), a discussion of paradigm overlaps, and a note on why tasks such as multimodal sarcasm detection are considered extensions of MSA rather than separate core tasks. This will better support the claim of a unified view without asserting completeness. We do not intend to perform a full PRISMA-style systematic review, as the survey's goal is to organize key paradigms rather than exhaustively catalog all work. revision: partial

Circularity Check

0 steps flagged

No circularity: survey paper with no derivations or predictions

full rationale

This is a literature survey that reviews existing work on four tasks (MSA, MERC, MABSA, MMER) and organizes methods into paradigms. It contains no equations, fitted parameters, predictions, or derivation chains that could reduce to inputs by construction. The claim of presenting a 'unified view' is an organizing framework for comparison of task formulations, datasets, and methods; it does not invoke self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that substitute for independent evidence. No patterns from the enumerated list apply. The paper is self-contained as a review and carries no circularity burden.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Survey paper containing no new models, derivations, or quantitative claims; therefore the ledger contains no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5744 in / 1210 out tokens · 22301 ms · 2026-05-23T21:17:13.571794+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Hormone-inspired Emotion Layer for Transformer language models (HELT)
cs.NE 2026-04 unverdicted novelty 7.0

HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning
cs.AI 2025-11 unverdicted novelty 5.0

PRC-Emo integrates prompt engineering, demonstration retrieval, and curriculum learning during LoRA fine-tuning to boost LLMs' emotion recognition in conversations, reaching new state-of-the-art results on IEMOCAP and MELD.
Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects
cs.HC 2025-10 unverdicted novelty 2.0

A holistic survey of affective computing for intelligent agents covering emotion understanding via multimodal data, affective cognition, emotional expression synthesis, key challenges, and future directions emphasizin...

Reference graph

Works this paper leans on

298 extracted references · 298 canonical work pages · cited by 3 Pith papers · 11 internal anchors

[1]

Tfcd: Towards multi-modal sarcasm detection via training-free coun- terfactual debiasing,

Z. Zhu, X. Zhuang, Y . Zhang, D. Xu, G. Hu, X. Wu, and Y . Zheng, “Tfcd: Towards multi-modal sarcasm detection via training-free coun- terfactual debiasing,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , K. Larson, Ed. International Joint Conferences on Artificial Intelligence Organization, 2024, ...

work page 2024
[2]

Towards multi- modal sarcasm detection via disentangled multi-grained multi-modal distilling,

Z. Zhu, X. Cheng, G. Hu, Y . Li, Z. Huang, and Y . Zou, “Towards multi- modal sarcasm detection via disentangled multi-grained multi-modal distilling,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy , N. Calzolari, M. Kan, V . Hoste...

work page 2024
[3]

Ben-Ze’ev, The subtlety of emotions

A. Ben-Ze’ev, The subtlety of emotions . MIT press, 2001

work page 2001
[4]

Emotions, sentiments, and performance expectations,

R. K. Shelly, “Emotions, sentiments, and performance expectations,” in Theory and research on human emotions . Emerald Group Publishing Limited, 2004

work page 2004
[5]

R. J. Davidson, K. R. Sherer, and H. H. Goldsmith, Handbook of affective sciences. Oxford University Press, 2009

work page 2009
[6]

Modeling latent discriminative dynamic of multi-dimensional affective signals,

G. A. Ram ´ırez, T. Baltrusaitis, and L. Morency, “Modeling latent discriminative dynamic of multi-dimensional affective signals,” in Affective Computing and Intelligent Interaction - Fourth International Conference, ACII 2011, Memphis, TN, USA, October 9-12, 2011, Proceedings, Part II, 2011, pp. 396–406

work page 2011
[7]

A multitask learning framework for multimodal sentiment analysis,

D. Jiang, R. Wei, H. Liu, J. Wen, G. Tu, L. Zheng, and E. Cambria, “A multitask learning framework for multimodal sentiment analysis,” in 2021 International conference on data mining workshops (ICDMW). IEEE, 2021, pp. 151–157

work page 2021
[8]

Align before fuse: Vision and language representation learning with momentum distillation,

J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Sys- tems 2021, NeurIPS 2021, December 6-14, 2021, virtual , M. Ranzato, A. Beygelz...

work page 2021
[9]

Modality distillation with multiple stream networks for action recognition,

N. C. Garcia, P. Morerio, and V . Murino, “Modality distillation with multiple stream networks for action recognition,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII , ser. Lecture Notes in Computer Science, V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, Eds., vol. 11212. ...

work page 2018
[10]

Diversified multiple instance learning for document-level multi-aspect sentiment classification,

Y . Ji, H. Liu, B. He, X. Xiao, H. Wu, and Y . Yu, “Diversified multiple instance learning for document-level multi-aspect sentiment classification,” in Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , 2020, pp. 7012– 7023

work page 2020
[11]

Identifying sentiment from crowd audio,

P. J. Donnelly and A. Prestwich, “Identifying sentiment from crowd audio,” in 7th International Conference on Frontiers of Signal Pro- cessing, ICFSP 2022, Paris, France, September 7-9, 2022 , 2022, pp. 64–69. 18

work page 2022
[12]

Learning relation- ships between text, audio, and video via deep canonical correlation for multimodal language analysis,

Z. Sun, P. K. Sarma, W. A. Sethares, and Y . Liang, “Learning relation- ships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI...

work page 2020
[13]

Factorized multimodal transformer for multimodal se- quential learning,

A. Zadeh, C. Mao, K. Shi, Y . Zhang, P. P. Liang, S. Poria, and L. Morency, “Factorized multimodal transformer for multimodal se- quential learning,” CoRR, vol. abs/1911.09826, 2019

work page arXiv 1911
[14]

Visual attention model for name tagging in multimodal social media,

D. Lu, L. Neves, V . Carvalho, N. Zhang, and H. Ji, “Visual attention model for name tagging in multimodal social media,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2018, pp. 1990–1999

work page 2018
[15]

Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,

A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,” Inf. Fusion, vol. 91, pp. 424–444, 2023

work page 2023
[16]

Multi-interactive memory network for aspect based multimodal sentiment analysis,

N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for aspect based multimodal sentiment analysis,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty- First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EA...

work page 2019
[17]

Multimodal emotion-cause pair extraction in conversations,

F. Wang, Z. Ding, R. Xia, Z. Li, and J. Yu, “Multimodal emotion-cause pair extraction in conversations,” CoRR, vol. abs/2110.08020, 2021

work page arXiv 2021
[18]

Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt,

X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y . Zhang, P. Hong, and S. Poria, “Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. 11 575–11 589

work page 2023
[19]

Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects,

S. Rahmani, S. Hosseini, R. Zall, M. R. Kangavari, S. Kamran, and W. Hua, “Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects,” Knowl. Based Syst., vol. 261, p. 110219, 2023

work page 2023
[20]

QAP: A quantum-inspired adaptive-priority-learning model for multimodal emotion recognition,

Z. Li, Y . Zhou, Y . Liu, F. Zhu, C. Yang, and S. Hu, “QAP: A quantum-inspired adaptive-priority-learning model for multimodal emotion recognition,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , 2023, pp. 12 191–12 204

work page 2023
[21]

Multimodal deep learning,

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y . Ng, “Multimodal deep learning,” in Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011 , L. Getoor and T. Scheffer, Eds. Omnipress, 2011, pp. 689–696

work page 2011
[22]

Integrating multimodal information in large pre- trained transformers,

W. Rahman, M. K. Hasan, S. Lee, A. B. Zadeh, C. Mao, L. Morency, and M. E. Hoque, “Integrating multimodal information in large pre- trained transformers,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 2020, pp. 2359–2369

work page 2020
[23]

A survey on multi-task learning,

Y . Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. Knowl. Data Eng. , vol. 34, no. 12, pp. 5586–5609, 2022

work page 2022
[24]

Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations,

Y . Xie, K. Yang, C. Sun, B. Liu, and Z. Ji, “Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations,” in Findings of the Asso- ciation for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , M. Moens, X. Huang, L. Specia, ...

work page 2021
[25]

A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,

W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , 2023, pp. 15 445–15 459

work page 2023
[26]

Unidu: Towards A unified generative dialogue understanding framework,

Z. Chen, L. Chen, B. Chen, L. Qin, Y . Liu, S. Zhu, J. Lou, and K. Yu, “Unidu: Towards A unified generative dialogue understanding framework,” CoRR, vol. abs/2204.04637, 2022

work page arXiv 2022
[27]

Univilm: A unified video and language pre-training model for mul- timodal understanding and generation,

H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou, “Univilm: A unified video and language pre-training model for mul- timodal understanding and generation,” CoRR, vol. abs/2002.06353, 2020

work page arXiv 2002
[28]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 2021, pp. 8748– 8763

work page 2021
[29]

Vlmo: Unified vision-language pre- training with mixture-of-modality-experts,

H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre- training with mixture-of-modality-experts,” in NeurIPS, 2022

work page 2022
[30]

Coca: Contrastive captioners are image-text foundation mod- els,

J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation mod- els,” Trans. Mach. Learn. Res. , vol. 2022, 2022

work page 2022
[31]

V ATT: transformers for multimodal self-supervised learning from raw video, audio and text,

H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y . Cui, and B. Gong, “V ATT: transformers for multimodal self-supervised learning from raw video, audio and text,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021, pp. 24 206–24 221

work page 2021
[32]

Parameter-efficient transfer learning for NLP,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 2019, pp. 2790–2799

work page 2019
[33]

Prefix-tuning: Optimizing continuous prompts for generation,

X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 , C. Zong, F. Xia, W. Li, and ...

work page 2021
[34]

Finetuned Language Models Are Zero-Shot Learners

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Language models are few- shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...

work page 2020
[36]

A Survey on In-context Learning

Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint arXiv:2301.00234, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Multimodal prompt transformer with hybrid contrastive learning for emotion recognition in conversation,

S. Zou, X. Huang, and X. Shen, “Multimodal prompt transformer with hybrid contrastive learning for emotion recognition in conversation,” CoRR, vol. abs/2310.04456, 2023

work page arXiv 2023
[38]

Unimse: Towards unified multimodal sentiment analysis and emotion recognition,

G. Hu, T. Lin, Y . Zhao, G. Lu, Y . Wu, and Y . Li, “Unimse: Towards unified multimodal sentiment analysis and emotion recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , 2022, pp. 7837–7851

work page 2022
[39]

Affective computing in the era of large language models: A survey from the nlp perspective,

Y . Zhang, X. Yang, X. Xu, Z. Gao, Y . Huang, S. Mu, S. Feng, D. Wang, Y . Zhang, K. Song et al. , “Affective computing in the era of large language models: A survey from the nlp perspective,” arXiv preprint arXiv:2408.04638, 2024

work page arXiv 2024
[40]

A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods,

B. Pan, K. Hirota, Z. Jia, and Y . Dai, “A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods,” Neurocomputing, vol. 561, p. 126866, 2023

work page 2023
[41]

Emotion recognition from unimodal to multimodal analysis: A review,

K. Ezzameli and H. Mahersia, “Emotion recognition from unimodal to multimodal analysis: A review,” Inf. Fusion, vol. 99, p. 101847, 2023

work page 2023
[42]

A review of chinese sentiment analysis: Subjects, methods, and trends

Z. W ANG, X. ZHANG, J. CUI, S.-B. HO, and E. CAMBRIA, “A review of chinese sentiment analysis: Subjects, methods, and trends.”

work page
[43]

Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,

A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,” Information Fusion, vol. 91, pp. 424–444, 2023

work page 2023
[44]

Multimodal sentiment analysis based on fusion methods: A survey,

L. Zhu, Z. Zhu, C. Zhang, Y . Xu, and X. Kong, “Multimodal sentiment analysis based on fusion methods: A survey,” Inf. Fusion, vol. 95, pp. 306–325, 2023

work page 2023
[45]

Sentiment classification using doc- ument embeddings trained with cosine similarity,

T. Thongtan and T. Phienthrakul, “Sentiment classification using doc- ument embeddings trained with cosine similarity,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 2: Student Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds. Associatio...

work page 2019
[46]

Towards multimodal sentiment analysis: harvesting opinions from the web,

L. Morency, R. Mihalcea, and P. Doshi, “Towards multimodal sentiment analysis: harvesting opinions from the web,” in Proceedings of the 13th International Conference on Multimodal Interfaces, ICMI 2011, Alicante, Spain, November 14-18, 2011 , 2011, pp. 169–176

work page 2011
[47]

Survey on multimodal approaches to emotion recognition,

A. G. A. and V . Vetriselvi, “Survey on multimodal approaches to emotion recognition,” Neurocomputing, vol. 556, p. 126693, 2023

work page 2023
[48]

A discourse-aware graph neural network for emotion recognition in multi-party conversation,

Y . Sun, N. Yu, and G. Fu, “A discourse-aware graph neural network for emotion recognition in multi-party conversation,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational Linguistic...

work page 2021
[49]

COGMEN: contextualized GNN based multimodal emotion recognition,

A. Joshi, A. Bhat, A. Jain, A. V . Singh, and A. Modi, “COGMEN: contextualized GNN based multimodal emotion recognition,” CoRR, vol. abs/2205.02455, 2022

work page arXiv 2022
[50]

Multi-interactive memory network for aspect based multimodal sentiment analysis,

N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for aspect based multimodal sentiment analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, no. 01, 2019, pp. 371–378

work page 2019
[51]

Transfer capsule network for aspect level sentiment classification,

Z. Chen and T. Qian, “Transfer capsule network for aspect level sentiment classification,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , A. Korhonen, D. R. Traum, and L. M `arquez, Eds. Association for Computational Linguistics, 2019, pp. 547–556

work page 2019
[52]

A unified generative framework for aspect-based sentiment analysis,

H. Yan, J. Dai, T. Ji, X. Qiu, and Z. Zhang, “A unified generative framework for aspect-based sentiment analysis,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 ,...

work page 2021
[53]

Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis,

C. Li, F. Gao, J. Bu, L. Xu, X. Chen, Y . Gu, Z. Shao, Q. Zheng, N. Zhang, Y . Wang, and Z. Yu, “Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis,” CoRR, vol. abs/2109.08306, 2021

work page arXiv 2021
[54]

SGM: sequence generation model for multi-label classification,

P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “SGM: sequence generation model for multi-label classification,” in Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018 , 2018, pp. 3915–3926

work page 2018
[55]

Label-specific dual graph neural network for multi-label text classification,

Q. Ma, C. Yuan, W. Zhou, and S. Hu, “Label-specific dual graph neural network for multi-label text classification,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 ,...

work page 2021
[56]

Distributed representations of words and phrases and their compo- sitionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compo- sitionality,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems

work page
[57]

3111–3119

Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States , 2013, pp. 3111–3119

work page 2013
[58]

Glove: Global vectors for word representation,

J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL , 2014, pp. 1532–1543

work page 2014
[59]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, ...

work page 2017
[60]

BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5- 10, 2020, 2020, pp....

work page 2020
[61]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res. , vol. 21, pp. 140:1–140:67, 2020

work page 2020
[62]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, vol. abs/2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. K...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” CoRR, vol. abs/2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

AST: audio spectrogram transformer,

Y . Gong, Y . Chung, and J. R. Glass, “AST: audio spectrogram transformer,” in 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , H. Hermansky, H. Cernock ´y, L. Burget, L. Lamel, O. Scharenborg, and P. Motl ´ıcek, Eds. ISCA, 2021, pp. 571–575

work page 2021
[66]

Efficientnet: Rethinking model scaling for convolutional neural networks,

M. Tan and Q. V . Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the 36th Interna- tional Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 6105–6114

work page 2019
[67]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenRev...

work page 2021
[68]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning , 2021, pp. 8748–8763

work page 2021
[69]

Gpt-4v(ision) system card,

“Gpt-4v(ision) system card,” 2023. [Online]. Available: https: //api.semanticscholar.org/CorpusID:263218031

work page 2023
[70]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al. , “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems , vol. 35, pp. 23 716–23 736, 2022

work page 2022
[72]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En- hancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Instructblip: Towards general-purpose vision- language models with instruction tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023

work page 2023
[75]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems , vol. 36, 2024

work page 2024
[76]

Scaling instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research , vol. 25, no. 70, pp. 1–53, 2024

work page 2024
[77]

Tensor fusion network for multimodal sentiment analysis,

A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, M. Palmer, R. Hwa, and S. Riedel, Eds. Association for Computational Linguistics, 2017, pp. 1103–1114

work page 2017
[78]

Multimodal transformer for unaligned multimodal language sequences,

Y . H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , A. Korhonen, D. R. Traum, and L. M `arque...

work page 2019
[79]

Inter-intra modal representa- tion augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis,

C. Chen, H. Hong, J. Guo, and B. Song, “Inter-intra modal representa- tion augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 1476–1488, 2023

work page 2023
[80]

CM-BERT: cross-modal BERT for text- audio sentiment analysis,

K. Yang, H. Xu, and K. Gao, “CM-BERT: cross-modal BERT for text- audio sentiment analysis,” in MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, 2020, pp. 521–528

work page 2020

Showing first 80 references.

[1] [1]

Tfcd: Towards multi-modal sarcasm detection via training-free coun- terfactual debiasing,

Z. Zhu, X. Zhuang, Y . Zhang, D. Xu, G. Hu, X. Wu, and Y . Zheng, “Tfcd: Towards multi-modal sarcasm detection via training-free coun- terfactual debiasing,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , K. Larson, Ed. International Joint Conferences on Artificial Intelligence Organization, 2024, ...

work page 2024

[2] [2]

Towards multi- modal sarcasm detection via disentangled multi-grained multi-modal distilling,

Z. Zhu, X. Cheng, G. Hu, Y . Li, Z. Huang, and Y . Zou, “Towards multi- modal sarcasm detection via disentangled multi-grained multi-modal distilling,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy , N. Calzolari, M. Kan, V . Hoste...

work page 2024

[3] [3]

Ben-Ze’ev, The subtlety of emotions

A. Ben-Ze’ev, The subtlety of emotions . MIT press, 2001

work page 2001

[4] [4]

Emotions, sentiments, and performance expectations,

R. K. Shelly, “Emotions, sentiments, and performance expectations,” in Theory and research on human emotions . Emerald Group Publishing Limited, 2004

work page 2004

[5] [5]

R. J. Davidson, K. R. Sherer, and H. H. Goldsmith, Handbook of affective sciences. Oxford University Press, 2009

work page 2009

[6] [6]

Modeling latent discriminative dynamic of multi-dimensional affective signals,

G. A. Ram ´ırez, T. Baltrusaitis, and L. Morency, “Modeling latent discriminative dynamic of multi-dimensional affective signals,” in Affective Computing and Intelligent Interaction - Fourth International Conference, ACII 2011, Memphis, TN, USA, October 9-12, 2011, Proceedings, Part II, 2011, pp. 396–406

work page 2011

[7] [7]

A multitask learning framework for multimodal sentiment analysis,

D. Jiang, R. Wei, H. Liu, J. Wen, G. Tu, L. Zheng, and E. Cambria, “A multitask learning framework for multimodal sentiment analysis,” in 2021 International conference on data mining workshops (ICDMW). IEEE, 2021, pp. 151–157

work page 2021

[8] [8]

Align before fuse: Vision and language representation learning with momentum distillation,

J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Sys- tems 2021, NeurIPS 2021, December 6-14, 2021, virtual , M. Ranzato, A. Beygelz...

work page 2021

[9] [9]

Modality distillation with multiple stream networks for action recognition,

N. C. Garcia, P. Morerio, and V . Murino, “Modality distillation with multiple stream networks for action recognition,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII , ser. Lecture Notes in Computer Science, V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, Eds., vol. 11212. ...

work page 2018

[10] [10]

Diversified multiple instance learning for document-level multi-aspect sentiment classification,

Y . Ji, H. Liu, B. He, X. Xiao, H. Wu, and Y . Yu, “Diversified multiple instance learning for document-level multi-aspect sentiment classification,” in Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , 2020, pp. 7012– 7023

work page 2020

[11] [11]

Identifying sentiment from crowd audio,

P. J. Donnelly and A. Prestwich, “Identifying sentiment from crowd audio,” in 7th International Conference on Frontiers of Signal Pro- cessing, ICFSP 2022, Paris, France, September 7-9, 2022 , 2022, pp. 64–69. 18

work page 2022

[12] [12]

Learning relation- ships between text, audio, and video via deep canonical correlation for multimodal language analysis,

Z. Sun, P. K. Sarma, W. A. Sethares, and Y . Liang, “Learning relation- ships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI...

work page 2020

[13] [13]

Factorized multimodal transformer for multimodal se- quential learning,

A. Zadeh, C. Mao, K. Shi, Y . Zhang, P. P. Liang, S. Poria, and L. Morency, “Factorized multimodal transformer for multimodal se- quential learning,” CoRR, vol. abs/1911.09826, 2019

work page arXiv 1911

[14] [14]

Visual attention model for name tagging in multimodal social media,

D. Lu, L. Neves, V . Carvalho, N. Zhang, and H. Ji, “Visual attention model for name tagging in multimodal social media,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2018, pp. 1990–1999

work page 2018

[15] [15]

Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,

A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,” Inf. Fusion, vol. 91, pp. 424–444, 2023

work page 2023

[16] [16]

Multi-interactive memory network for aspect based multimodal sentiment analysis,

N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for aspect based multimodal sentiment analysis,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty- First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EA...

work page 2019

[17] [17]

Multimodal emotion-cause pair extraction in conversations,

F. Wang, Z. Ding, R. Xia, Z. Li, and J. Yu, “Multimodal emotion-cause pair extraction in conversations,” CoRR, vol. abs/2110.08020, 2021

work page arXiv 2021

[18] [18]

Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt,

X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y . Zhang, P. Hong, and S. Poria, “Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. 11 575–11 589

work page 2023

[19] [19]

Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects,

S. Rahmani, S. Hosseini, R. Zall, M. R. Kangavari, S. Kamran, and W. Hua, “Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects,” Knowl. Based Syst., vol. 261, p. 110219, 2023

work page 2023

[20] [20]

QAP: A quantum-inspired adaptive-priority-learning model for multimodal emotion recognition,

Z. Li, Y . Zhou, Y . Liu, F. Zhu, C. Yang, and S. Hu, “QAP: A quantum-inspired adaptive-priority-learning model for multimodal emotion recognition,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , 2023, pp. 12 191–12 204

work page 2023

[21] [21]

Multimodal deep learning,

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y . Ng, “Multimodal deep learning,” in Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011 , L. Getoor and T. Scheffer, Eds. Omnipress, 2011, pp. 689–696

work page 2011

[22] [22]

Integrating multimodal information in large pre- trained transformers,

W. Rahman, M. K. Hasan, S. Lee, A. B. Zadeh, C. Mao, L. Morency, and M. E. Hoque, “Integrating multimodal information in large pre- trained transformers,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 2020, pp. 2359–2369

work page 2020

[23] [23]

A survey on multi-task learning,

Y . Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. Knowl. Data Eng. , vol. 34, no. 12, pp. 5586–5609, 2022

work page 2022

[24] [24]

Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations,

Y . Xie, K. Yang, C. Sun, B. Liu, and Z. Ji, “Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations,” in Findings of the Asso- ciation for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , M. Moens, X. Huang, L. Specia, ...

work page 2021

[25] [25]

A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,

W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , 2023, pp. 15 445–15 459

work page 2023

[26] [26]

Unidu: Towards A unified generative dialogue understanding framework,

Z. Chen, L. Chen, B. Chen, L. Qin, Y . Liu, S. Zhu, J. Lou, and K. Yu, “Unidu: Towards A unified generative dialogue understanding framework,” CoRR, vol. abs/2204.04637, 2022

work page arXiv 2022

[27] [27]

Univilm: A unified video and language pre-training model for mul- timodal understanding and generation,

H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou, “Univilm: A unified video and language pre-training model for mul- timodal understanding and generation,” CoRR, vol. abs/2002.06353, 2020

work page arXiv 2002

[28] [28]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 2021, pp. 8748– 8763

work page 2021

[29] [29]

Vlmo: Unified vision-language pre- training with mixture-of-modality-experts,

H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre- training with mixture-of-modality-experts,” in NeurIPS, 2022

work page 2022

[30] [30]

Coca: Contrastive captioners are image-text foundation mod- els,

J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation mod- els,” Trans. Mach. Learn. Res. , vol. 2022, 2022

work page 2022

[31] [31]

V ATT: transformers for multimodal self-supervised learning from raw video, audio and text,

H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y . Cui, and B. Gong, “V ATT: transformers for multimodal self-supervised learning from raw video, audio and text,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021, pp. 24 206–24 221

work page 2021

[32] [32]

Parameter-efficient transfer learning for NLP,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 2019, pp. 2790–2799

work page 2019

[33] [33]

Prefix-tuning: Optimizing continuous prompts for generation,

X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 , C. Zong, F. Xia, W. Li, and ...

work page 2021

[34] [34]

Finetuned Language Models Are Zero-Shot Learners

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

Language models are few- shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...

work page 2020

[36] [36]

A Survey on In-context Learning

Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint arXiv:2301.00234, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Multimodal prompt transformer with hybrid contrastive learning for emotion recognition in conversation,

S. Zou, X. Huang, and X. Shen, “Multimodal prompt transformer with hybrid contrastive learning for emotion recognition in conversation,” CoRR, vol. abs/2310.04456, 2023

work page arXiv 2023

[38] [38]

Unimse: Towards unified multimodal sentiment analysis and emotion recognition,

G. Hu, T. Lin, Y . Zhao, G. Lu, Y . Wu, and Y . Li, “Unimse: Towards unified multimodal sentiment analysis and emotion recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , 2022, pp. 7837–7851

work page 2022

[39] [39]

Affective computing in the era of large language models: A survey from the nlp perspective,

Y . Zhang, X. Yang, X. Xu, Z. Gao, Y . Huang, S. Mu, S. Feng, D. Wang, Y . Zhang, K. Song et al. , “Affective computing in the era of large language models: A survey from the nlp perspective,” arXiv preprint arXiv:2408.04638, 2024

work page arXiv 2024

[40] [40]

A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods,

B. Pan, K. Hirota, Z. Jia, and Y . Dai, “A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods,” Neurocomputing, vol. 561, p. 126866, 2023

work page 2023

[41] [41]

Emotion recognition from unimodal to multimodal analysis: A review,

K. Ezzameli and H. Mahersia, “Emotion recognition from unimodal to multimodal analysis: A review,” Inf. Fusion, vol. 99, p. 101847, 2023

work page 2023

[42] [42]

A review of chinese sentiment analysis: Subjects, methods, and trends

Z. W ANG, X. ZHANG, J. CUI, S.-B. HO, and E. CAMBRIA, “A review of chinese sentiment analysis: Subjects, methods, and trends.”

work page

[43] [43]

Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,

A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,” Information Fusion, vol. 91, pp. 424–444, 2023

work page 2023

[44] [44]

Multimodal sentiment analysis based on fusion methods: A survey,

L. Zhu, Z. Zhu, C. Zhang, Y . Xu, and X. Kong, “Multimodal sentiment analysis based on fusion methods: A survey,” Inf. Fusion, vol. 95, pp. 306–325, 2023

work page 2023

[45] [45]

Sentiment classification using doc- ument embeddings trained with cosine similarity,

T. Thongtan and T. Phienthrakul, “Sentiment classification using doc- ument embeddings trained with cosine similarity,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 2: Student Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds. Associatio...

work page 2019

[46] [46]

Towards multimodal sentiment analysis: harvesting opinions from the web,

L. Morency, R. Mihalcea, and P. Doshi, “Towards multimodal sentiment analysis: harvesting opinions from the web,” in Proceedings of the 13th International Conference on Multimodal Interfaces, ICMI 2011, Alicante, Spain, November 14-18, 2011 , 2011, pp. 169–176

work page 2011

[47] [47]

Survey on multimodal approaches to emotion recognition,

A. G. A. and V . Vetriselvi, “Survey on multimodal approaches to emotion recognition,” Neurocomputing, vol. 556, p. 126693, 2023

work page 2023

[48] [48]

A discourse-aware graph neural network for emotion recognition in multi-party conversation,

Y . Sun, N. Yu, and G. Fu, “A discourse-aware graph neural network for emotion recognition in multi-party conversation,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational Linguistic...

work page 2021

[49] [49]

COGMEN: contextualized GNN based multimodal emotion recognition,

A. Joshi, A. Bhat, A. Jain, A. V . Singh, and A. Modi, “COGMEN: contextualized GNN based multimodal emotion recognition,” CoRR, vol. abs/2205.02455, 2022

work page arXiv 2022

[50] [50]

Multi-interactive memory network for aspect based multimodal sentiment analysis,

N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for aspect based multimodal sentiment analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, no. 01, 2019, pp. 371–378

work page 2019

[51] [51]

Transfer capsule network for aspect level sentiment classification,

Z. Chen and T. Qian, “Transfer capsule network for aspect level sentiment classification,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , A. Korhonen, D. R. Traum, and L. M `arquez, Eds. Association for Computational Linguistics, 2019, pp. 547–556

work page 2019

[52] [52]

A unified generative framework for aspect-based sentiment analysis,

H. Yan, J. Dai, T. Ji, X. Qiu, and Z. Zhang, “A unified generative framework for aspect-based sentiment analysis,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 ,...

work page 2021

[53] [53]

Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis,

C. Li, F. Gao, J. Bu, L. Xu, X. Chen, Y . Gu, Z. Shao, Q. Zheng, N. Zhang, Y . Wang, and Z. Yu, “Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis,” CoRR, vol. abs/2109.08306, 2021

work page arXiv 2021

[54] [54]

SGM: sequence generation model for multi-label classification,

P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “SGM: sequence generation model for multi-label classification,” in Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018 , 2018, pp. 3915–3926

work page 2018

[55] [55]

Label-specific dual graph neural network for multi-label text classification,

Q. Ma, C. Yuan, W. Zhou, and S. Hu, “Label-specific dual graph neural network for multi-label text classification,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 ,...

work page 2021

[56] [56]

Distributed representations of words and phrases and their compo- sitionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compo- sitionality,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems

work page

[57] [57]

3111–3119

Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States , 2013, pp. 3111–3119

work page 2013

[58] [58]

Glove: Global vectors for word representation,

J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL , 2014, pp. 1532–1543

work page 2014

[59] [59]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, ...

work page 2017

[60] [60]

BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5- 10, 2020, 2020, pp....

work page 2020

[61] [61]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res. , vol. 21, pp. 140:1–140:67, 2020

work page 2020

[62] [62]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, vol. abs/2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. K...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” CoRR, vol. abs/2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

AST: audio spectrogram transformer,

Y . Gong, Y . Chung, and J. R. Glass, “AST: audio spectrogram transformer,” in 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , H. Hermansky, H. Cernock ´y, L. Burget, L. Lamel, O. Scharenborg, and P. Motl ´ıcek, Eds. ISCA, 2021, pp. 571–575

work page 2021

[66] [66]

Efficientnet: Rethinking model scaling for convolutional neural networks,

M. Tan and Q. V . Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the 36th Interna- tional Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 6105–6114

work page 2019

[67] [67]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenRev...

work page 2021

[68] [68]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning , 2021, pp. 8748–8763

work page 2021

[69] [69]

Gpt-4v(ision) system card,

“Gpt-4v(ision) system card,” 2023. [Online]. Available: https: //api.semanticscholar.org/CorpusID:263218031

work page 2023

[70] [70]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [71]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al. , “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems , vol. 35, pp. 23 716–23 736, 2022

work page 2022

[72] [72]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[73] [73]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En- hancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[74] [74]

Instructblip: Towards general-purpose vision- language models with instruction tuning,

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023

work page 2023

[75] [75]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems , vol. 36, 2024

work page 2024

[76] [76]

Scaling instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research , vol. 25, no. 70, pp. 1–53, 2024

work page 2024

[77] [77]

Tensor fusion network for multimodal sentiment analysis,

A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, M. Palmer, R. Hwa, and S. Riedel, Eds. Association for Computational Linguistics, 2017, pp. 1103–1114

work page 2017

[78] [78]

Multimodal transformer for unaligned multimodal language sequences,

Y . H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , A. Korhonen, D. R. Traum, and L. M `arque...

work page 2019

[79] [79]

Inter-intra modal representa- tion augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis,

C. Chen, H. Hong, J. Guo, and B. Song, “Inter-intra modal representa- tion augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 1476–1488, 2023

work page 2023

[80] [80]

CM-BERT: cross-modal BERT for text- audio sentiment analysis,

K. Yang, H. Xu, and K. Gao, “CM-BERT: cross-modal BERT for text- audio sentiment analysis,” in MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, 2020, pp. 521–528

work page 2020