Recent Advances in Multimodal Affective Computing: An NLP Perspective
Pith reviewed 2026-05-23 21:17 UTC · model grok-4.3
The pith
A survey establishes a unified view of multimodal affective computing by comparing four NLP tasks and organizing methods into four modeling paradigms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By examining the four tasks of multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER), and classifying representative methods into the paradigms of multitask learning, pre-trained models, knowledge enhancement, and contextual modeling, the survey establishes a unified view of the field that facilitates comparison across task formulations, datasets, and evaluation protocols while highlighting key challenges and future directions.
What carries the argument
The four representative tasks together with the four modeling paradigms that serve as the organizing structure for cross-task comparison and synthesis of the literature.
If this is right
- Patterns in how methods handle multimodal inputs become comparable across tasks that previously appeared separate.
- Gaps in benchmark datasets and evaluation protocols are identified for potential standardization.
- The framework extends naturally to facial, acoustic, and physiological modalities as well as emotion cause analysis.
- A curated repository of works and resources is supplied to support ongoing research.
- Common challenges such as modality fusion and contextual understanding are positioned as priorities for future work.
Where Pith is reading between the lines
- The same four-paradigm lens could be applied to test whether emerging methods require an additional category.
- NLP-derived insights on text-centric fusion might transfer to tasks that start from visual or audio data alone.
- Quantifying performance differences across the four paradigms on shared datasets could reveal which approach scales best.
- The organization may surface connections between affective computing and adjacent areas such as dialogue systems or human-AI interaction.
Load-bearing premise
The four chosen tasks and four listed modeling paradigms form a representative and non-arbitrary partition of the current literature.
What would settle it
Discovery of a substantial body of recent multimodal affective computing papers that fit none of the four tasks and none of the four paradigms.
Figures
read the original abstract
Multimodal affective computing has gained increasing attention due to its broad applications in understanding human behavior and intentions, particularly in text-centric multimodal scenarios. Existing research spans diverse tasks, modalities, and modeling paradigms, yet lacks a unified perspective. In this survey, we systematically review recent advances from an NLP perspective, focusing on four representative tasks: multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER). We present a unified view by comparing task formulations, benchmark datasets, and evaluation protocols, and by organizing representative methods into key paradigms, including multitask learning, pre-trained models, knowledge enhancement, and contextual modeling. We further extend the discussion to related directions, such as facial, acoustic, and physiological modalities, as well as emotion cause analysis. Finally, we highlight key challenges and outline promising future directions. To facilitate further research, we release a curated repository of relevant works and resources \footnote{https://anonymous.4open.science/r/Multimodal-Affective-Computing-Survey-9819}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a survey reviewing recent advances in multimodal affective computing from an NLP perspective. It focuses on four tasks—multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER)—compares their formulations, benchmark datasets, and evaluation protocols, organizes representative methods into four paradigms (multitask learning, pre-trained models, knowledge enhancement, contextual modeling), extends discussion to related modalities and emotion cause analysis, highlights challenges and future directions, and releases a curated repository of resources.
Significance. If the four tasks and paradigms are shown to be representative, the survey would provide a useful organizational framework for the literature, enabling comparisons across task formulations, datasets, and methods while the released repository strengthens reproducibility and follow-on work. The organizational contribution is the primary value, as no new empirical results or derivations are claimed.
major comments (1)
- [Abstract and §1] Abstract and §1 (Introduction): the claim that the four tasks 'constitute a representative' partition sufficient for a 'unified view' is load-bearing for the central contribution, yet the manuscript provides no systematic search protocol, publication-count justification, overlap analysis across paradigms, or argument that omitted tasks (e.g., multimodal sarcasm detection) are marginal; without these the unification remains an ad-hoc organizing framework whose completeness cannot be assessed.
minor comments (2)
- [Abstract] The footnote URL for the repository is given as anonymous; a permanent link or DOI should be provided in the camera-ready version.
- Table or section enumerating the four paradigms would benefit from an explicit overlap matrix or decision tree showing how methods are assigned when they span multiple paradigms.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The primary concern raised is addressed in the point-by-point response below. We believe these revisions will improve the clarity and rigor of the survey's organizational framework.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1 (Introduction): the claim that the four tasks 'constitute a representative' partition sufficient for a 'unified view' is load-bearing for the central contribution, yet the manuscript provides no systematic search protocol, publication-count justification, overlap analysis across paradigms, or argument that omitted tasks (e.g., multimodal sarcasm detection) are marginal; without these the unification remains an ad-hoc organizing framework whose completeness cannot be assessed.
Authors: We acknowledge the validity of this observation. The manuscript presents the four tasks as representative based on their prominence in the NLP multimodal affective computing literature, but does not provide explicit justification or a search protocol. In the revised version, we will add a paragraph in Section 1 explaining the selection criteria, including approximate publication counts for each task drawn from major venues (e.g., ACL, EMNLP, CVPR), a discussion of paradigm overlaps, and a note on why tasks such as multimodal sarcasm detection are considered extensions of MSA rather than separate core tasks. This will better support the claim of a unified view without asserting completeness. We do not intend to perform a full PRISMA-style systematic review, as the survey's goal is to organize key paradigms rather than exhaustively catalog all work. revision: partial
Circularity Check
No circularity: survey paper with no derivations or predictions
full rationale
This is a literature survey that reviews existing work on four tasks (MSA, MERC, MABSA, MMER) and organizes methods into paradigms. It contains no equations, fitted parameters, predictions, or derivation chains that could reduce to inputs by construction. The claim of presenting a 'unified view' is an organizing framework for comparison of task formulations, datasets, and methods; it does not invoke self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that substitute for independent evidence. No patterns from the enumerated list apply. The paper is self-contained as a review and carries no circularity burden.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
A Hormone-inspired Emotion Layer for Transformer language models (HELT)
HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
-
Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning
PRC-Emo integrates prompt engineering, demonstration retrieval, and curriculum learning during LoRA fine-tuning to boost LLMs' emotion recognition in conversations, reaching new state-of-the-art results on IEMOCAP and MELD.
-
Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects
A holistic survey of affective computing for intelligent agents covering emotion understanding via multimodal data, affective cognition, emotional expression synthesis, key challenges, and future directions emphasizin...
Reference graph
Works this paper leans on
-
[1]
Tfcd: Towards multi-modal sarcasm detection via training-free coun- terfactual debiasing,
Z. Zhu, X. Zhuang, Y . Zhang, D. Xu, G. Hu, X. Wu, and Y . Zheng, “Tfcd: Towards multi-modal sarcasm detection via training-free coun- terfactual debiasing,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , K. Larson, Ed. International Joint Conferences on Artificial Intelligence Organization, 2024, ...
work page 2024
-
[2]
Towards multi- modal sarcasm detection via disentangled multi-grained multi-modal distilling,
Z. Zhu, X. Cheng, G. Hu, Y . Li, Z. Huang, and Y . Zou, “Towards multi- modal sarcasm detection via disentangled multi-grained multi-modal distilling,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy , N. Calzolari, M. Kan, V . Hoste...
work page 2024
-
[3]
Ben-Ze’ev, The subtlety of emotions
A. Ben-Ze’ev, The subtlety of emotions . MIT press, 2001
work page 2001
-
[4]
Emotions, sentiments, and performance expectations,
R. K. Shelly, “Emotions, sentiments, and performance expectations,” in Theory and research on human emotions . Emerald Group Publishing Limited, 2004
work page 2004
-
[5]
R. J. Davidson, K. R. Sherer, and H. H. Goldsmith, Handbook of affective sciences. Oxford University Press, 2009
work page 2009
-
[6]
Modeling latent discriminative dynamic of multi-dimensional affective signals,
G. A. Ram ´ırez, T. Baltrusaitis, and L. Morency, “Modeling latent discriminative dynamic of multi-dimensional affective signals,” in Affective Computing and Intelligent Interaction - Fourth International Conference, ACII 2011, Memphis, TN, USA, October 9-12, 2011, Proceedings, Part II, 2011, pp. 396–406
work page 2011
-
[7]
A multitask learning framework for multimodal sentiment analysis,
D. Jiang, R. Wei, H. Liu, J. Wen, G. Tu, L. Zheng, and E. Cambria, “A multitask learning framework for multimodal sentiment analysis,” in 2021 International conference on data mining workshops (ICDMW). IEEE, 2021, pp. 151–157
work page 2021
-
[8]
Align before fuse: Vision and language representation learning with momentum distillation,
J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Sys- tems 2021, NeurIPS 2021, December 6-14, 2021, virtual , M. Ranzato, A. Beygelz...
work page 2021
-
[9]
Modality distillation with multiple stream networks for action recognition,
N. C. Garcia, P. Morerio, and V . Murino, “Modality distillation with multiple stream networks for action recognition,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII , ser. Lecture Notes in Computer Science, V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, Eds., vol. 11212. ...
work page 2018
-
[10]
Diversified multiple instance learning for document-level multi-aspect sentiment classification,
Y . Ji, H. Liu, B. He, X. Xiao, H. Wu, and Y . Yu, “Diversified multiple instance learning for document-level multi-aspect sentiment classification,” in Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , 2020, pp. 7012– 7023
work page 2020
-
[11]
Identifying sentiment from crowd audio,
P. J. Donnelly and A. Prestwich, “Identifying sentiment from crowd audio,” in 7th International Conference on Frontiers of Signal Pro- cessing, ICFSP 2022, Paris, France, September 7-9, 2022 , 2022, pp. 64–69. 18
work page 2022
-
[12]
Z. Sun, P. K. Sarma, W. A. Sethares, and Y . Liang, “Learning relation- ships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI...
work page 2020
-
[13]
Factorized multimodal transformer for multimodal se- quential learning,
A. Zadeh, C. Mao, K. Shi, Y . Zhang, P. P. Liang, S. Poria, and L. Morency, “Factorized multimodal transformer for multimodal se- quential learning,” CoRR, vol. abs/1911.09826, 2019
-
[14]
Visual attention model for name tagging in multimodal social media,
D. Lu, L. Neves, V . Carvalho, N. Zhang, and H. Ji, “Visual attention model for name tagging in multimodal social media,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2018, pp. 1990–1999
work page 2018
-
[15]
A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,” Inf. Fusion, vol. 91, pp. 424–444, 2023
work page 2023
-
[16]
Multi-interactive memory network for aspect based multimodal sentiment analysis,
N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for aspect based multimodal sentiment analysis,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty- First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EA...
work page 2019
-
[17]
Multimodal emotion-cause pair extraction in conversations,
F. Wang, Z. Ding, R. Xia, Z. Li, and J. Yu, “Multimodal emotion-cause pair extraction in conversations,” CoRR, vol. abs/2110.08020, 2021
-
[18]
Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt,
X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y . Zhang, P. Hong, and S. Poria, “Few-shot joint multimodal aspect-sentiment analysis based on generative multimodal prompt,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. 11 575–11 589
work page 2023
-
[19]
Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects,
S. Rahmani, S. Hosseini, R. Zall, M. R. Kangavari, S. Kamran, and W. Hua, “Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects,” Knowl. Based Syst., vol. 261, p. 110219, 2023
work page 2023
-
[20]
QAP: A quantum-inspired adaptive-priority-learning model for multimodal emotion recognition,
Z. Li, Y . Zhou, Y . Liu, F. Zhu, C. Yang, and S. Hu, “QAP: A quantum-inspired adaptive-priority-learning model for multimodal emotion recognition,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , 2023, pp. 12 191–12 204
work page 2023
-
[21]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y . Ng, “Multimodal deep learning,” in Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011 , L. Getoor and T. Scheffer, Eds. Omnipress, 2011, pp. 689–696
work page 2011
-
[22]
Integrating multimodal information in large pre- trained transformers,
W. Rahman, M. K. Hasan, S. Lee, A. B. Zadeh, C. Mao, L. Morency, and M. E. Hoque, “Integrating multimodal information in large pre- trained transformers,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 2020, pp. 2359–2369
work page 2020
-
[23]
A survey on multi-task learning,
Y . Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. Knowl. Data Eng. , vol. 34, no. 12, pp. 5586–5609, 2022
work page 2022
-
[24]
Y . Xie, K. Yang, C. Sun, B. Liu, and Z. Ji, “Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations,” in Findings of the Asso- ciation for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , M. Moens, X. Huang, L. Specia, ...
work page 2021
-
[25]
W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , 2023, pp. 15 445–15 459
work page 2023
-
[26]
Unidu: Towards A unified generative dialogue understanding framework,
Z. Chen, L. Chen, B. Chen, L. Qin, Y . Liu, S. Zhu, J. Lou, and K. Yu, “Unidu: Towards A unified generative dialogue understanding framework,” CoRR, vol. abs/2204.04637, 2022
-
[27]
H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou, “Univilm: A unified video and language pre-training model for mul- timodal understanding and generation,” CoRR, vol. abs/2002.06353, 2020
-
[28]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 2021, pp. 8748– 8763
work page 2021
-
[29]
Vlmo: Unified vision-language pre- training with mixture-of-modality-experts,
H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre- training with mixture-of-modality-experts,” in NeurIPS, 2022
work page 2022
-
[30]
Coca: Contrastive captioners are image-text foundation mod- els,
J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation mod- els,” Trans. Mach. Learn. Res. , vol. 2022, 2022
work page 2022
-
[31]
V ATT: transformers for multimodal self-supervised learning from raw video, audio and text,
H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y . Cui, and B. Gong, “V ATT: transformers for multimodal self-supervised learning from raw video, audio and text,” in Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021, pp. 24 206–24 221
work page 2021
-
[32]
Parameter-efficient transfer learning for NLP,
N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 2019, pp. 2790–2799
work page 2019
-
[33]
Prefix-tuning: Optimizing continuous prompts for generation,
X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 , C. Zong, F. Xia, W. Li, and ...
work page 2021
-
[34]
Finetuned Language Models Are Zero-Shot Learners
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
Language models are few- shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...
work page 2020
-
[36]
A Survey on In-context Learning
Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint arXiv:2301.00234, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
S. Zou, X. Huang, and X. Shen, “Multimodal prompt transformer with hybrid contrastive learning for emotion recognition in conversation,” CoRR, vol. abs/2310.04456, 2023
-
[38]
Unimse: Towards unified multimodal sentiment analysis and emotion recognition,
G. Hu, T. Lin, Y . Zhao, G. Lu, Y . Wu, and Y . Li, “Unimse: Towards unified multimodal sentiment analysis and emotion recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Nat- ural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , 2022, pp. 7837–7851
work page 2022
-
[39]
Affective computing in the era of large language models: A survey from the nlp perspective,
Y . Zhang, X. Yang, X. Xu, Z. Gao, Y . Huang, S. Mu, S. Feng, D. Wang, Y . Zhang, K. Song et al. , “Affective computing in the era of large language models: A survey from the nlp perspective,” arXiv preprint arXiv:2408.04638, 2024
-
[40]
B. Pan, K. Hirota, Z. Jia, and Y . Dai, “A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods,” Neurocomputing, vol. 561, p. 126866, 2023
work page 2023
-
[41]
Emotion recognition from unimodal to multimodal analysis: A review,
K. Ezzameli and H. Mahersia, “Emotion recognition from unimodal to multimodal analysis: A review,” Inf. Fusion, vol. 99, p. 101847, 2023
work page 2023
-
[42]
A review of chinese sentiment analysis: Subjects, methods, and trends
Z. W ANG, X. ZHANG, J. CUI, S.-B. HO, and E. CAMBRIA, “A review of chinese sentiment analysis: Subjects, methods, and trends.”
-
[43]
A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- timodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future direc- tions,” Information Fusion, vol. 91, pp. 424–444, 2023
work page 2023
-
[44]
Multimodal sentiment analysis based on fusion methods: A survey,
L. Zhu, Z. Zhu, C. Zhang, Y . Xu, and X. Kong, “Multimodal sentiment analysis based on fusion methods: A survey,” Inf. Fusion, vol. 95, pp. 306–325, 2023
work page 2023
-
[45]
Sentiment classification using doc- ument embeddings trained with cosine similarity,
T. Thongtan and T. Phienthrakul, “Sentiment classification using doc- ument embeddings trained with cosine similarity,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 2: Student Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds. Associatio...
work page 2019
-
[46]
Towards multimodal sentiment analysis: harvesting opinions from the web,
L. Morency, R. Mihalcea, and P. Doshi, “Towards multimodal sentiment analysis: harvesting opinions from the web,” in Proceedings of the 13th International Conference on Multimodal Interfaces, ICMI 2011, Alicante, Spain, November 14-18, 2011 , 2011, pp. 169–176
work page 2011
-
[47]
Survey on multimodal approaches to emotion recognition,
A. G. A. and V . Vetriselvi, “Survey on multimodal approaches to emotion recognition,” Neurocomputing, vol. 556, p. 126693, 2023
work page 2023
-
[48]
A discourse-aware graph neural network for emotion recognition in multi-party conversation,
Y . Sun, N. Yu, and G. Fu, “A discourse-aware graph neural network for emotion recognition in multi-party conversation,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 , M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational Linguistic...
work page 2021
-
[49]
COGMEN: contextualized GNN based multimodal emotion recognition,
A. Joshi, A. Bhat, A. Jain, A. V . Singh, and A. Modi, “COGMEN: contextualized GNN based multimodal emotion recognition,” CoRR, vol. abs/2205.02455, 2022
-
[50]
Multi-interactive memory network for aspect based multimodal sentiment analysis,
N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for aspect based multimodal sentiment analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, no. 01, 2019, pp. 371–378
work page 2019
-
[51]
Transfer capsule network for aspect level sentiment classification,
Z. Chen and T. Qian, “Transfer capsule network for aspect level sentiment classification,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , A. Korhonen, D. R. Traum, and L. M `arquez, Eds. Association for Computational Linguistics, 2019, pp. 547–556
work page 2019
-
[52]
A unified generative framework for aspect-based sentiment analysis,
H. Yan, J. Dai, T. Ji, X. Qiu, and Z. Zhang, “A unified generative framework for aspect-based sentiment analysis,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 ,...
work page 2021
-
[53]
Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis,
C. Li, F. Gao, J. Bu, L. Xu, X. Chen, Y . Gu, Z. Shao, Q. Zheng, N. Zhang, Y . Wang, and Z. Yu, “Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis,” CoRR, vol. abs/2109.08306, 2021
-
[54]
SGM: sequence generation model for multi-label classification,
P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “SGM: sequence generation model for multi-label classification,” in Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018 , 2018, pp. 3915–3926
work page 2018
-
[55]
Label-specific dual graph neural network for multi-label text classification,
Q. Ma, C. Yuan, W. Zhou, and S. Hu, “Label-specific dual graph neural network for multi-label text classification,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 ,...
work page 2021
-
[56]
Distributed representations of words and phrases and their compo- sitionality,
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compo- sitionality,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems
- [57]
-
[58]
Glove: Global vectors for word representation,
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL , 2014, pp. 1532–1543
work page 2014
-
[59]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, ...
work page 2017
-
[60]
M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5- 10, 2020, 2020, pp....
work page 2020
-
[61]
Exploring the limits of transfer learning with a unified text-to-text transformer,
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res. , vol. 21, pp. 140:1–140:67, 2020
work page 2020
-
[62]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, vol. abs/2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. K...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” CoRR, vol. abs/2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
AST: audio spectrogram transformer,
Y . Gong, Y . Chung, and J. R. Glass, “AST: audio spectrogram transformer,” in 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , H. Hermansky, H. Cernock ´y, L. Burget, L. Lamel, O. Scharenborg, and P. Motl ´ıcek, Eds. ISCA, 2021, pp. 571–575
work page 2021
-
[66]
Efficientnet: Rethinking model scaling for convolutional neural networks,
M. Tan and Q. V . Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the 36th Interna- tional Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 6105–6114
work page 2019
-
[67]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenRev...
work page 2021
-
[68]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning , 2021, pp. 8748–8763
work page 2021
-
[69]
“Gpt-4v(ision) system card,” 2023. [Online]. Available: https: //api.semanticscholar.org/CorpusID:263218031
work page 2023
-
[70]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al. , “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems , vol. 35, pp. 23 716–23 736, 2022
work page 2022
-
[72]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En- hancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Instructblip: Towards general-purpose vision- language models with instruction tuning,
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023
work page 2023
-
[75]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems , vol. 36, 2024
work page 2024
-
[76]
Scaling instruction-finetuned language models,
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research , vol. 25, no. 70, pp. 1–53, 2024
work page 2024
-
[77]
Tensor fusion network for multimodal sentiment analysis,
A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, M. Palmer, R. Hwa, and S. Riedel, Eds. Association for Computational Linguistics, 2017, pp. 1103–1114
work page 2017
-
[78]
Multimodal transformer for unaligned multimodal language sequences,
Y . H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , A. Korhonen, D. R. Traum, and L. M `arque...
work page 2019
-
[79]
C. Chen, H. Hong, J. Guo, and B. Song, “Inter-intra modal representa- tion augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 1476–1488, 2023
work page 2023
-
[80]
CM-BERT: cross-modal BERT for text- audio sentiment analysis,
K. Yang, H. Xu, and K. Gao, “CM-BERT: cross-modal BERT for text- audio sentiment analysis,” in MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, 2020, pp. 521–528
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.