arxiv: 2604.18460 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

Sijie Mai , Shiqin Han

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords multimodal learningcausal inferenceinvariant representationdisentanglementrobustnessdistribution shiftsspurious correlationsaffective computing

0 comments

The pith

Multimodal models can learn stable causal representations by disentangling each modality into invariant and spurious parts using causal constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework called CmIR that separates each input modality into a causal invariant representation and an environment-specific spurious representation. This separation is achieved through an invariance constraint that keeps predictions stable across environments, a mutual information constraint that removes unwanted dependence on spurious factors, and a reconstruction constraint that retains enough information from the original inputs. A sympathetic reader would care because current multimodal systems often pick up brittle correlations that break under distribution shifts or noisy data, leading to poor real-world performance in tasks like sentiment prediction from language, audio, and video. The method aims to produce representations whose link to the target label remains reliable even when the surrounding conditions change.

Core claim

By framing multimodal representation learning as a causal inference problem, CmIR disentangles each modality into a causal invariant part that maintains stable predictive relationships with the label across environments and a spurious part that captures environment-specific noise. The framework enforces this split with three constraints: an invariance constraint to ensure consistent prediction, a mutual information constraint to minimize leakage of spurious information, and a reconstruction constraint to preserve sufficient input information. Experiments show that the resulting invariant representations yield improved generalization on out-of-distribution and noisy multimodal benchmarks.

What carries the argument

The causal modality-invariant representation (CmIR) framework, which performs theoretically grounded disentanglement of each modality into causal invariant and environment-specific spurious representations enforced by invariance, mutual information, and reconstruction constraints.

If this is right

Invariant representations retain stable predictive relationships with labels even under distribution shifts.
The method improves performance on noisy multimodal inputs compared with models that learn spurious correlations.
Reconstruction and mutual information constraints together prevent loss of useful information during disentanglement.
The approach delivers state-of-the-art results on multiple affective computing benchmarks while excelling on out-of-distribution test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disentanglement logic could be tested on vision-language or sensor-fusion tasks where environment shifts are common.
If the constraints succeed, they might offer a template for making other multimodal systems more robust without task-specific redesign.
Applying the framework to single-modality data by treating data subsets as different environments could reveal whether the causal split generalizes beyond multiple modalities.

Load-bearing premise

That the three constraints can reliably isolate causal invariant representations from spurious ones in practice without discarding information needed for accurate prediction.

What would settle it

An experiment in which the learned invariant representations fail to maintain stable accuracy across deliberately varied environments or when modalities are corrupted, while standard multimodal baselines do not show this gap.

Figures

Figures reproduced from arXiv: 2604.18460 by Shiqin Han, Sijie Mai.

**Figure 1.** Figure 1: A case study on CMU-MOSI. Vanilla model without causal inference makes incorrect prediction for the test sample where the speaker delivers negative comment while smiling, while CmIR accurately predicts the label based on correct causal relationships. test data distributions differs from training distributions under distribution shifts or noisy modality conditions (Zhuang et al., 2025; Peters et al., 201… view at source ↗

**Figure 2.** Figure 2: The SCM of CmIR for prediction process. It [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The overall framework of CmIR and the visualization of the proposed constraints. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The results on (a) UR-FUNNY (Hasan et al., 2019) and (b) MUStARD (Castro et al., 2019) datasets. 5.3 Results under OOD scenarios. The results under OOD scenarios is depicted in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Acc2 of CmIR w.r.t the change of constraint weights and the number of environments. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: T-SNE visualization of language features with [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: The structures of encoder, decoder, and pre [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Multimodal affective computing aims to predict humans' sentiment, emotion, intention, and opinion using language, acoustic, and visual modalities. However, current models often learn spurious correlations that harm generalization under distribution shifts or noisy modalities. To address this, we propose a causal modality-invariant representation (CmIR) learning framework for robust multimodal learning. At its core, we introduce a theoretically grounded disentanglement method that separates each modality into `causal invariant representation' and `environment-specific spurious representation' from a causal inference perspective. CmIR ensures that the learned invariant representations retain stable predictive relationships with labels across different environments while preserving sufficient information from the raw inputs via invariance constraint, mutual information constraint, and reconstruction constraint. Experiments across multiple multimodal benchmarks demonstrate that CmIR achieves state-of-the-art performance. CmIR particularly excels on out-of-distribution data and noisy data, confirming its robustness and generalizability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CmIR applies standard regularizers under a causal label to multimodal disentanglement but lacks clear identifiability results to make the causal claim load-bearing.

read the letter

The paper's core move is to frame multimodal representation learning as a causal disentanglement problem. It splits each modality into a causal invariant part that should stay predictive across environments and an environment-specific spurious part, then optimizes three constraints—inariance, mutual information, and reconstruction—to enforce the split. This targets a practical pain point in affective computing where models latch onto spurious correlations under distribution shifts or noisy inputs, and the reported gains on OOD and noisy benchmarks suggest the approach delivers measurable robustness improvements over prior multimodal baselines.

Referee Report

2 major / 2 minor

Summary. The paper proposes CmIR, a causal modality-invariant representation learning framework for robust multimodal affective computing. It claims to disentangle each modality into a 'causal invariant representation' and an 'environment-specific spurious representation' using an invariance constraint, a mutual information constraint, and a reconstruction constraint. The method is said to preserve stable predictive relationships with labels across environments while retaining sufficient input information, leading to state-of-the-art performance on multimodal benchmarks with particular gains on out-of-distribution and noisy data.

Significance. If the disentanglement is shown to be identifiably causal and the robustness gains are reproducible, the work could strengthen the link between causal inference and multimodal representation learning, offering a practical way to mitigate spurious correlations in affective computing tasks where distribution shifts and modality noise are common.

major comments (2)

[Abstract and §3] Abstract and §3 (Method): The central claim that the three constraints produce a 'theoretically grounded' separation into causal invariant vs. spurious representations lacks an identifiability theorem or explicit causal graph. Standard regularizers (adversarial invariance, MI maximization, reconstruction) do not by themselves guarantee recovery of the true causal features under a data-generating process with environment interventions; without a proof or synthetic SCM experiment demonstrating unique recovery, the 'causal' label does not add load-bearing content beyond ordinary domain-invariant learning.
[§5] §5 (Experiments): The reported SOTA results on OOD and noisy data are presented without baseline details, ablation studies on the individual constraints, or controls for the number of environments. It is therefore impossible to determine whether the gains are attributable to the claimed causal disentanglement or to generic regularization effects.

minor comments (2)

[§3] Notation for the three constraints should be introduced with explicit loss equations and hyper-parameter schedules rather than descriptive prose only.
[§2] The paper should include a clear statement of the assumed causal model (e.g., which variables are intervened across environments) to make the causal-inference framing falsifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our work. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that the three constraints produce a 'theoretically grounded' separation into causal invariant vs. spurious representations lacks an identifiability theorem or explicit causal graph. Standard regularizers (adversarial invariance, MI maximization, reconstruction) do not by themselves guarantee recovery of the true causal features under a data-generating process with environment interventions; without a proof or synthetic SCM experiment demonstrating unique recovery, the 'causal' label does not add load-bearing content beyond ordinary domain-invariant learning.

Authors: We thank the referee for this observation. The framework is motivated by causal inference principles to separate modality representations into those that maintain stable predictive relationships with the label (invariant/causal) versus those that capture environment-specific spurious correlations. The invariance constraint enforces cross-environment stability, the mutual information constraint ensures retention of label-relevant information, and the reconstruction constraint preserves input fidelity. While these are standard regularizers, their joint application under the multimodal affective computing setting with distribution shifts is intended to target the causal structure. We acknowledge that the manuscript does not include a formal identifiability theorem or explicit SCM experiment. In revision we will update the abstract and §3 to explicitly state the modeling assumptions, clarify that the causal perspective is motivational and interpretive rather than a proven identifiability result, and distinguish the approach from generic domain-invariant learning. revision: partial
Referee: [§5] §5 (Experiments): The reported SOTA results on OOD and noisy data are presented without baseline details, ablation studies on the individual constraints, or controls for the number of environments. It is therefore impossible to determine whether the gains are attributable to the claimed causal disentanglement or to generic regularization effects.

Authors: We appreciate the referee highlighting the need for stronger experimental controls. The original submission reports comparisons to multiple multimodal baselines and includes some ablation results. To directly address the concern, the revised §5 will add (i) detailed per-constraint ablations (removing invariance, MI, or reconstruction one at a time), (ii) explicit controls varying the number of training environments, and (iii) expanded baseline descriptions and implementation details. These additions will allow readers to assess whether performance gains on OOD and noisy data arise from the proposed disentanglement rather than generic regularization. revision: yes

Circularity Check

1 steps flagged

Invariant representations are defined and enforced by the three optimization constraints

specific steps

self definitional [Abstract]
"we introduce a theoretically grounded disentanglement method that separates each modality into `causal invariant representation' and `environment-specific spurious representation' from a causal inference perspective. CmIR ensures that the learned invariant representations retain stable predictive relationships with labels across different environments while preserving sufficient information from the raw inputs via invariance constraint, mutual information constraint, and reconstruction constraint."

The separation into 'causal invariant' vs. 'spurious' is not derived from a causal graph or intervention; it is defined as whatever satisfies the three listed constraints during training. The 'causal inference perspective' label is therefore applied to the output of the optimization rather than independently justified.

full rationale

The paper's core claim of a 'theoretically grounded disentanglement' from causal inference reduces to naming the outputs of standard regularizers (invariance, MI, reconstruction) as 'causal invariant' vs. 'spurious'. No independent identifiability theorem, SCM recovery proof, or external verification is shown; the separation is produced exactly by the losses being minimized. This matches self-definitional circularity at the central step. The rest of the framework (benchmarks, OOD gains) is non-circular but inherits the definitional issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so free parameters (likely loss weights for the three constraints), detailed axioms, and full implementation choices cannot be extracted. The paper introduces two new representational concepts grounded in causal inference.

axioms (1)

domain assumption Causal mechanisms that produce stable predictive relationships exist and can be isolated from environment-specific spurious correlations across modalities
Invoked by the claim that the disentanglement method separates causal invariant from spurious representations.

invented entities (2)

causal invariant representation no independent evidence
purpose: Captures stable predictive relationships with labels that hold across different environments
New concept introduced as the target output of the disentanglement method.
environment-specific spurious representation no independent evidence
purpose: Captures non-causal features tied to particular data environments that can be separated out
New concept introduced to explain and remove spurious correlations.

pith-pipeline@v0.9.0 · 5449 in / 1521 out tokens · 51822 ms · 2026-05-10T04:29:29.642340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

299 extracted references · 50 canonical work pages · 12 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

International Conference on Machine Learning , pages=

An Investigation of Why Overparameterization Exacerbates Spurious Correlations , author=. International Conference on Machine Learning , pages=. 2020 , organization=

2020
[6]

Foundations of Computational Mathematics , volume=

Learning Models with Uniform Performance via Distributionally Robust Optimization , author=. Foundations of Computational Mathematics , volume=. 2021 , publisher=

2021
[7]

International Conference on Learning Representations , year=

The Risks of Invariant Risk Minimization , author=. International Conference on Learning Representations , year=
[8]

Forty-second International Conference on Machine Learning , year=

Learning Invariant Causal Mechanism from Vision-Language Models , author=. Forty-second International Conference on Machine Learning , year=
[9]

Cambridge, UK: CambridgeUniversityPress , volume=

Models, reasoning and inference , author=. Cambridge, UK: CambridgeUniversityPress , volume=
[10]

International Conference on Learning Representations , year=

Invariant risk minimization , author=. International Conference on Learning Representations , year=
[11]

Journal of the London Mathematical Society , volume=

Some theorems on distribution functions , author=. Journal of the London Mathematical Society , volume=. 1936 , publisher=

1936
[12]

The information bottleneck method

The information bottleneck method , author=. arXiv preprint physics/0004057 , year=

work page internal anchor Pith review arXiv
[13]

Journal of Machine Learning Research , volume=

Composite binary losses , author=. Journal of Machine Learning Research , volume=
[14]

Dan Gusfield , title =. 1997

1997
[15]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[16]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[17]

Learning Multitask Commonness and Uniqueness for Multimodal Sarcasm Detection and Sentiment Analysis in Conversation , year=

Zhang, Yazhou and Yu, Yang and Zhao, Dongming and Li, Zuhe and Wang, Bo and Hou, Yuexian and Tiwari, Prayag and Qin, Jing , journal=. Learning Multitask Commonness and Uniqueness for Multimodal Sarcasm Detection and Sentiment Analysis in Conversation , year=
[18]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

EMOE: Modality-Specific Enhanced Dynamic Emotion Experts , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[19]

Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education

Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)
[20]

Classification Problem Solving

Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence
[21]

, title =

Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =

1980
[22]

New Ways to Make Microcircuits Smaller---Duplicate Entry

Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science
[23]

Clancey and Glenn Rennels , abstract =

Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =

work page doi:10.1016/s0020-7373(84)80003-6 1984
[24]

and Rennels, Glenn R

Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies
[25]

Poligon: A System for Parallel Problem Solving

Rice, James. Poligon: A System for Parallel Problem Solving
[26]

Transfer of Rule-Based Expertise through a Tutorial Dialogue

Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue
[27]

The Engineering of Qualitative Models

Clancey, William J. The Engineering of Qualitative Models
[28]

2017 , eprint=

Attention Is All You Need , author=. 2017 , eprint=

2017
[29]

Pluto: The 'Other' Red Planet

NASA. Pluto: The 'Other' Red Planet
[30]

CoRR , volume =

Joshua Goodman , title =. CoRR , volume =. 2001 , url =

2001
[31]

Joshua T. Goodman. A bit of progress in language modeling. Computer Speech & Language. 2001. doi:10.1006/csla.2001.0174

work page doi:10.1006/csla.2001.0174 2001
[32]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Proceedings of the 31st ACM International Conference on Multimedia , pages=

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis , author=. Proceedings of the 31st ACM International Conference on Multimedia , pages=
[34]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

MMATR: A Lightweight Approach for Multimodal Sentiment Analysis Based on Tensor Methods , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

2023
[35]

Companion Proceedings of the ACM on Web Conference 2024 , pages=

Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion , author=. Companion Proceedings of the ACM on Web Conference 2024 , pages=

2024
[36]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[37]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

CLGSI: A Multimodal Sentiment Analysis Framework based on Contrastive Learning Guided by Sentiment Intensity , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

2024
[38]

IEEE Transactions on Affective Computing , year=

Multi-level Contrastive Learning: Hierarchical Alleviation of Heterogeneity in Multimodal Sentiment Analysis , author=. IEEE Transactions on Affective Computing , year=
[39]

Information Fusion , volume=

Learning from the global view: Supervised contrastive learning of multimodal representation , author=. Information Fusion , volume=. 2023 , publisher=

2023
[40]

Advances in Neural Information Processing Systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=
[41]

IEEE Transactions on Affective Computing , year=

Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis , author=. IEEE Transactions on Affective Computing , year=
[42]

CoRR , volume =

Rebecca Hwa , title =. CoRR , volume =. 1999 , url =

1999
[43]

Supervised Grammar Induction using Training Data with Limited Constituent Information

Hwa, Rebecca. Supervised Grammar Induction using Training Data with Limited Constituent Information. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 1999

1999
[44]

, title =

Jurafsky, Daniel and Martin, James H. , title =
[45]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

2022
[46]

arXiv preprint arXiv:2107.06779 (2021)

Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation , author=. arXiv preprint arXiv:2107.06779 , year=

work page arXiv
[47]

Directed Acyclic Graph Network for Conversational Emotion Recognition , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[48]

DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , year=

2019
[49]

Information Fusion , volume=

AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis , author=. Information Fusion , volume=. 2023 , publisher=

2023
[50]

International conference on machine learning , pages=

Efficientnet: Rethinking model scaling for convolutional neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[51]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

ConFEDE: Contrastive Feature Decomposition for Multimodal Sentiment Analysis , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[52]

Proceedings of grand challenge and workshop on human multimodal language (Challenge-HML) , pages=

Dnn multimodal fusion techniques for predicting video sentiment , author=. Proceedings of grand challenge and workshop on human multimodal language (Challenge-HML) , pages=
[53]

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Mai, Sijie and Hu, Haifeng and Xing, Songlong. Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing. Proceedings of the 57th Conference of the Association for Computational Linguistics. 2019

2019
[54]

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Mai, Sijie and Hu, Haifeng and Xing, Songlong. Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing. ACL. 2019

2019
[55]

A Unimodal Representation Learning and Recurrent Decomposition Fusion Structure for Utterance-level Multimodal Embedding Learning , year=

Mai, Sijie and Hu, Haifeng and Xing, Songlong , journal=. A Unimodal Representation Learning and Recurrent Decomposition Fusion Structure for Utterance-level Multimodal Embedding Learning , year=
[56]

IEEE Transactions on Affective Computing , title=

S. IEEE Transactions on Affective Computing , title=. 2020 , volume=

2020
[57]

Attentive matching network for few-shot learning , journal=

Mai, Sijie and Hu, Haifeng and Xu, Jia , year =. Attentive matching network for few-shot learning , journal=. doi:https://doi.org/10.1016/j.cviu.2019.07.001 , url =

work page doi:10.1016/j.cviu.2019.07.001 2019
[58]

IEEE Transactions on Multimedia , title=

S. IEEE Transactions on Multimedia , title=. 2020 , volume=

2020
[59]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[60]

AAAI , volume=

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion , author=. AAAI , volume=
[61]

EMNLP , year=

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , author=. EMNLP , year=
[62]

Audiovisual Speech Activity Detection with Advanced Long Short-Term Memory , booktitle=

Tao, Fei and Busso, Carlos , year =. Audiovisual Speech Activity Detection with Advanced Long Short-Term Memory , booktitle=
[63]

arXiv preprint arXiv:2002.08267 , year=

Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation , author=. arXiv preprint arXiv:2002.08267 , year=

work page arXiv 2002
[64]

Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

2019
[65]

EMNLP-IJCNLP , pages=

Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis , author=. EMNLP-IJCNLP , pages=
[66]

Proceedings of the 27th ACM International Conference on Multimedia , pages=

Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language , author=. Proceedings of the 27th ACM International Conference on Multimedia , pages=
[67]

Krishna , journal=

Wilson, Shyju and Mohan, C. Krishna , journal=. An Information Bottleneck Approach to Optimize the Dictionary of Visual Data , year=
[68]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[69]

The Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

2020
[70]

EMNLP , year=

Dual Low-Rank Multimodal Fusion , author=. EMNLP , year=
[71]

Proceedings of the 28th ACM International Conference on Multimedia , pages=

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis , author=. Proceedings of the 28th ACM International Conference on Multimedia , pages=
[72]

ACM MM , pages=

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis , author=. ACM MM , pages=
[73]

Computer , volume=

Modeling multimodal human-computer interaction , author=. Computer , volume=. 2004 , publisher=

2004
[74]

NAACL-HLT , year=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. NAACL-HLT , year=
[75]

ArXiv , year=

Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment , author=. ArXiv , year=
[76]

Information Fusion , volume=

Excavating multimodal correlation for representation learning , author=. Information Fusion , volume=. 2023 , publisher=

2023
[77]

ACL , year=

Integrating Multimodal Information in Large Pretrained Transformers , author=. ACL , year=
[78]

, author=

Distance metric learning for large margin nearest neighbor classification. , author=. Journal of machine learning research , volume=
[79]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

MCSE: Multimodal Contrastive Learning of Sentence Embeddings , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

2022
[80]

NAACL , pages=

MCSE: Multimodal Contrastive Learning of Sentence Embeddings , author=. NAACL , pages=

Showing first 80 references.