Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis
Pith reviewed 2026-05-24 20:59 UTC · model grok-4.3
The pith
Fusing text, audio, and video with one-step deep canonical correlation analysis produces embeddings that improve sentiment classification beyond text-only results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the one-Step DCCA outperforms the current state-of-the-art in learning multi-modal embeddings, as shown through higher classification accuracy on sentiment tasks when embeddings are learned from text, audio, and video views together.
What carries the argument
Deep Canonical Correlation Analysis (DCCA) in one-step form, which directly maximizes correlations across the three data views to form joint embeddings.
If this is right
- The one-step version of DCCA produces better embeddings than the two-step version.
- Multi-modal embeddings from all three views exceed the performance of BERT text embeddings alone.
- Each of the three views contributes to the quality of the final embedding.
- The performance gains hold on both standard benchmark datasets and the new Debate Emotion dataset.
Where Pith is reading between the lines
- This fusion technique might be tested on other tasks that combine multiple data modes, such as video emotion recognition.
- Measuring exact contribution ratios from each view could support more efficient selection of which modes to include.
- Applying the same one-step approach to additional data modes like images could check whether the gains generalize.
Load-bearing premise
The three views contain complementary information that DCCA can exploit to produce embeddings whose downstream classification accuracy exceeds what the BERT text view alone achieves, and that this improvement can be measured without post-hoc view selection.
What would settle it
If sentiment classification experiments show that one-step DCCA embeddings do not achieve higher accuracy than BERT text embeddings alone on the same datasets, the central claim would be falsified.
read the original abstract
This paper learns multi-modal embeddings from text, audio, and video views/modes of data in order to improve upon down-stream sentiment classification. The experimental framework also allows investigation of the relative contributions of the individual views in the final multi-modal embedding. Individual features derived from the three views are combined into a multi-modal embedding using Deep Canonical Correlation Analysis (DCCA) in two ways i) One-Step DCCA and ii) Two-Step DCCA. This paper learns text embeddings using BERT, the current state-of-the-art in text encoders. We posit that this highly optimized algorithm dominates over the contribution of other views, though each view does contribute to the final result. Classification tasks are carried out on two benchmark datasets and on a new Debate Emotion data set, and together these demonstrate that the one-Step DCCA outperforms the current state-of-the-art in learning multi-modal embeddings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes learning multi-modal embeddings for sentiment analysis by fusing text features (extracted via BERT), audio, and video using Deep Canonical Correlation Analysis (DCCA) in one-step and two-step variants. It evaluates the approach on two benchmark datasets plus a new Debate Emotion dataset and claims that one-step DCCA outperforms the current state-of-the-art in multi-modal embedding learning while also allowing analysis of individual view contributions.
Significance. If the empirical claims are supported by proper controls and ablations, the work would provide evidence that DCCA can still yield gains from audio/video even when a dominant text encoder like BERT is used, and the new dataset would be a useful community resource.
major comments (2)
- [Abstract] Abstract: the central claim that one-Step DCCA outperforms the current state-of-the-art is asserted without any reported accuracies, baselines, statistical tests, or ablation details. The full experimental section is required to assess whether this claim is supported.
- [Experimental evaluation] Experimental evaluation: the claim that one-Step DCCA outperforms prior multi-modal SOTA holds only if the reported classification accuracies exceed those of a BERT text-only baseline on the same tasks; the abstract notes that BERT dominates the other views, so absence of this control leaves the assumption that DCCA extracts complementary signal from audio/video untested.
minor comments (1)
- [Abstract] Abstract: the distinction between one-Step DCCA and two-Step DCCA is mentioned but not briefly summarized, which would help readers understand the experimental framework.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that one-Step DCCA outperforms the current state-of-the-art is asserted without any reported accuracies, baselines, statistical tests, or ablation details. The full experimental section is required to assess whether this claim is supported.
Authors: The full manuscript contains the experimental results, including accuracies on the benchmark datasets and the new Debate Emotion dataset, comparisons against prior multi-modal baselines, and ablation studies on view contributions (see Sections 4 and 5). We agree the abstract would benefit from greater specificity. In the revision we will update the abstract to report the key accuracy improvements and note that results are statistically significant relative to the strongest baselines. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation: the claim that one-Step DCCA outperforms prior multi-modal SOTA holds only if the reported classification accuracies exceed those of a BERT text-only baseline on the same tasks; the abstract notes that BERT dominates the other views, so absence of this control leaves the assumption that DCCA extracts complementary signal from audio/video untested.
Authors: We acknowledge this is a substantive concern. Although the manuscript already compares against prior multi-modal embedding methods, the absence of an explicit BERT text-only control does leave the incremental value of the audio/video views via DCCA less directly tested. We will add the BERT text-only baseline to all three datasets in the revised experimental section, together with the corresponding ablation on view contributions, so that the complementary signal can be quantified. revision: yes
Circularity Check
No circularity; purely empirical claims on external benchmarks
full rationale
The paper reports an empirical comparison of one-step and two-step DCCA for fusing BERT text embeddings with audio and video features on three sentiment datasets. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All performance claims are measured against external SOTA baselines and are therefore falsifiable outside the paper's own inputs. The reader's assessment of score 2.0 is consistent with the absence of any load-bearing internal reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction V arious social media platforms make available a variety of multi-modal content generated through expression of opini ons and ideologies by social media users in the form of written com- mentary, podcasts, and lifestyle vlogs on a variety of topic s such as politics, entertainment, reviews of movies, products etc. Multi-modal data enables one...
-
[2]
En- coded outputs are then concatenated and passed through a cla s- sifier
Related Work The idea of combining multi-modal text, audio and video fea- tures expressed in this paper is closest to that of [1] which encodes text, speech, and visual signals using using a BiL- STM encoder, openSMILE, and 3D-CNN respectively. En- coded outputs are then concatenated and passed through a cla s- sifier. In contrast, this paper employs multi...
-
[3]
Methods This section briefly reviews Deep Canonical Correlation Ana l- ysis (DCCA) and outlines the methods used to obtain unimodal features. This section also outlines the procedure used to o btain the multi-modal embeddings used for experiments in Section 4. 3.1. Deep Canonical Correlation Analysis (DCCA) Classic Canonical Correlation Analysis (CCA) [14]...
-
[4]
Henceforth, audio embeddings are denoted by va. • Video Encoding: Framewise features from video stream are extracted using a combination of FACET2 and Open- 2https://imotions.com/facial-expressions/ Face 2.0 [16] 3. For each 10 second duration video, video-level feature vectors are obtained by averaging across the feature vectors corresponding to individu...
work page 2016
-
[5]
Henceforth, video embeddings are denoted by vv. 3.3. Methodology DCCA accepts two views of data at a time and learns a corre- lated subspace. Since we are working with three views of data , all three views must be combined. We consider two different procedures. The One-Step DCCA concatenates the audio and video features and applies DCCA to this combined a...
-
[6]
5: ˆv′ 1 = [ ¯v′ 1| v′ 1], ˆv′ 2 = [ ¯v′ 2| v′ 2]
← DCCA(v′ 1, v′ 2). 5: ˆv′ 1 = [ ¯v′ 1| v′ 1], ˆv′ 2 = [ ¯v′ 2| v′ 2]. 6: Return vmulti = [ ˆv′ 1|ˆv′ 2]. 3.3.3. Sentiment Classification Multi-view embeddings obtained from One-Step DCCA and Two-Step DCCA are input to a logistic regression classifier t o predict the sentiment label for test data sets in Section 4
-
[7]
Experiments This section first describes the different test data sets use d and the baseline methods that are evaluated against embeddings obtained from One-Step DCCA and Two-Step DCCA. Multi- modal embeddings obtained from DCCA methods are input to a logistic regression classifier and accuracy and F-scores o n test data sets are reported as the performance...
-
[8]
There area total of 17859 data points available for binary sen- timent classification
Frames scored (0, 3] are labeled as ‘positive’ and scores between [−3, 0) are labeled as ‘negative’. There area total of 17859 data points available for binary sen- timent classification. Data is partitioned into predeter- mined train (12787 points), validation (3634 points) and test (1438 points) splits. Raw features for CMU-MOSI and CMU-MOSEI are obtaine...
work page 2016
-
[9]
Discussions and Conclusions The key issue is this: which is more important, text, speech, or video? Being able to process all three views of human discour se simultaneously allows consideration of the basic question of the relative contributions of the semantics, the spoken delive ry, and the accompanying images. Conventional wisdom would be that the text...
-
[10]
Multimodal sentiment analysis: Addressin g key issues and setting up the baselines,
S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbuk h, and A. Hussain, “Multimodal sentiment analysis: Addressin g key issues and setting up the baselines,” IEEE Intelligent Systems , vol. 33, no. 6, pp. 17–25, 2018
work page 2018
-
[11]
Multimodal sentiment analysis to e x- plore the structure of emotions,
A. Hu and S. Flaxman, “Multimodal sentiment analysis to e x- plore the structure of emotions,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 350–358
work page 2018
-
[12]
Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,
R. Arora and K. Livescu, “Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,” in 2013 IEEE International Conference on Acoustics, Speech and Sig nal Processing. IEEE, 2013, pp. 7135–7139
work page 2013
-
[13]
Performing populism : Trumps transgressive debate style and the dynamics of twitt er re- sponse,
E. P . Bucy, J. M. Foley, J. Lukito, L. Doroshenko, D. V . Sha h, J. Pevehouse, C. Wells, and E. P . Bucy, “Performing populism : Trumps transgressive debate style and the dynamics of twitt er re- sponse,” New Media & Society, 2018
work page 2018
-
[14]
J. Joo, E. P . Bucy, and C. Seidel, “Automated coding of tel e- vised leader displays: Detecting nonverbal political beha vior with computer vision,” in International Journal of Communication , In press
-
[15]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: P re- training of deep bidirectional transformers for language u nder- standing,” arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos
A. Zadeh, R. Zellers, E. Pincus, and L.-P . Morency, “Mosi : mul- timodal corpus of sentiment intensity and subjectivity ana lysis in online opinion videos,” arXiv preprint arXiv:1606.06259, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[17]
A. Zadeh, P . P . Liang, J. V anbriesen, S. Poria, E. Cambria , M. Chen, and L.-P . Morency, “Multimodal language analysis i n the wild: CMU-MOSEI dataset and interpretable dynamic fu- sion graph,” in Association for Computational Linguistics (ACL) , 2018
work page 2018
-
[18]
Deep cano ni- cal correlation analysis,
G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep cano ni- cal correlation analysis,” in International conference on machine learning, 2013, pp. 1247–1255
work page 2013
-
[19]
Audio-visual fusion for sentiment classification using cross-modal autoencoder
S. H. Dumpala, I. Sheikh, R. Chakraborty, and S. K. Kop- parapu, “Audio-visual fusion for sentiment classification using cross-modal autoencoder.”
-
[20]
Multiview lsa: R ep- resentation learning via generalized cca,
P . Rastogi, B. V an Durme, and R. Arora, “Multiview lsa: R ep- resentation learning via generalized cca,” in Proceedings of the 2015 Conference of the North American Chapter of the Associa - tion for Computational Linguistics: Human Language Techno lo- gies, 2015, pp. 556–566
work page 2015
-
[21]
Domain adapted w ord embeddings for improved sentiment classification,
P . K. Sarma, Y . Liang, and B. Sethares, “Domain adapted w ord embeddings for improved sentiment classification,” in Proceed- ings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (V olume 2: Short Papers) , vol. 2, 2018, pp. 37–42
work page 2018
-
[22]
Learning multiview embed- dings of twitter users,
A. Benton, R. Arora, and M. Dredze, “Learning multiview embed- dings of twitter users,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), vol. 2, 2016, pp. 14–19
work page 2016
-
[23]
Relations between two sets of variates,
H. Hotelling, “Relations between two sets of variates, ” in Break- throughs in statistics. Springer, 1992, pp. 162–190
work page 1992
-
[24]
Co- varepa collaborative voice analysis repository for speech tech- nologies,
G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scher er, “Co- varepa collaborative voice analysis repository for speech tech- nologies,” in 2014 ieee international conference on acoustics, speech and signal processing (icassp) . IEEE, 2014, pp. 960– 964
work page 2014
-
[25]
Open- face 2.0: Facial behavior analysis toolkit,
T. Baltrusaitis, A. Zadeh, Y . C. Lim, and L.-P . Morency, “Open- face 2.0: Facial behavior analysis toolkit,” in 2018 13th IEEE In- ternational Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 59–66
work page 2018
-
[26]
Cross- dataset learning and person-specific normalisation for automatic a ction unit detection,
T. Baltruˇ saitis, M. Mahmoud, and P . Robinson, “Cross- dataset learning and person-specific normalisation for automatic a ction unit detection,” in 2015 11th IEEE International Conference and W orkshops on Automatic Face and Gesture Recognition (FG) , vol. 6. IEEE, 2015, pp. 1–6
work page 2015
-
[27]
Multi-attention Recurrent Network for Human Communication Comprehension
A. Zadeh, P . P . Liang, S. Poria, P . Vij, E. Cambria, and L. - P . Morency, “Multi-attention recurrent network for human c om- munication comprehension,” arXiv preprint arXiv:1802.00923 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Generalized canonical correlations and the ir applica- tions to experimental data,
P . Horst, “Generalized canonical correlations and the ir applica- tions to experimental data,” in Journal of Clinical Psychology, 17(4), 1961
work page 1961
-
[29]
M. E. Grabe and E. P . Bucy, Image bite politics: News and the visual framing of elections. Oxford University Press, 2009
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.