Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis

Erik P. Bucy; Prathusha K Sarma; William Sethares; Zhongkai Sun

arxiv: 1907.08696 · v1 · pith:MOBBPFIUnew · submitted 2019-07-15 · 💻 cs.IR · cs.CL· cs.LG· stat.ML

Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis

Zhongkai Sun , Prathusha K Sarma , William Sethares , Erik P. Bucy This is my paper

Pith reviewed 2026-05-24 20:59 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LGstat.ML

keywords multi-modal embeddingsdeep canonical correlation analysissentiment classificationBERTtext audio videoone-step DCCAmulti-view fusion

0 comments

The pith

Fusing text, audio, and video with one-step deep canonical correlation analysis produces embeddings that improve sentiment classification beyond text-only results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to establish that one-step Deep Canonical Correlation Analysis can combine text from BERT with audio and video views to create multi-modal embeddings that outperform current methods on sentiment classification. It tests both one-step and two-step versions of the method and measures how much each view adds to the result. Experiments on two benchmark datasets plus a new Debate Emotion dataset show higher accuracy when all three views are fused. A sympathetic reader would care if this holds because it offers a direct way to exploit complementary signals across data modes for better downstream performance.

Core claim

The paper claims that the one-Step DCCA outperforms the current state-of-the-art in learning multi-modal embeddings, as shown through higher classification accuracy on sentiment tasks when embeddings are learned from text, audio, and video views together.

What carries the argument

Deep Canonical Correlation Analysis (DCCA) in one-step form, which directly maximizes correlations across the three data views to form joint embeddings.

If this is right

The one-step version of DCCA produces better embeddings than the two-step version.
Multi-modal embeddings from all three views exceed the performance of BERT text embeddings alone.
Each of the three views contributes to the quality of the final embedding.
The performance gains hold on both standard benchmark datasets and the new Debate Emotion dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This fusion technique might be tested on other tasks that combine multiple data modes, such as video emotion recognition.
Measuring exact contribution ratios from each view could support more efficient selection of which modes to include.
Applying the same one-step approach to additional data modes like images could check whether the gains generalize.

Load-bearing premise

The three views contain complementary information that DCCA can exploit to produce embeddings whose downstream classification accuracy exceeds what the BERT text view alone achieves, and that this improvement can be measured without post-hoc view selection.

What would settle it

If sentiment classification experiments show that one-step DCCA embeddings do not achieve higher accuracy than BERT text embeddings alone on the same datasets, the central claim would be falsified.

read the original abstract

This paper learns multi-modal embeddings from text, audio, and video views/modes of data in order to improve upon down-stream sentiment classification. The experimental framework also allows investigation of the relative contributions of the individual views in the final multi-modal embedding. Individual features derived from the three views are combined into a multi-modal embedding using Deep Canonical Correlation Analysis (DCCA) in two ways i) One-Step DCCA and ii) Two-Step DCCA. This paper learns text embeddings using BERT, the current state-of-the-art in text encoders. We posit that this highly optimized algorithm dominates over the contribution of other views, though each view does contribute to the final result. Classification tasks are carried out on two benchmark datasets and on a new Debate Emotion data set, and together these demonstrate that the one-Step DCCA outperforms the current state-of-the-art in learning multi-modal embeddings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines BERT with DCCA for multi-modal sentiment but lacks evidence that fusion beats the text view alone.

read the letter

The main point to know is that the authors combine BERT text features with audio and video using one-step and two-step Deep Canonical Correlation Analysis for sentiment analysis, claiming the one-step version beats existing multi-modal approaches on two benchmarks and a new Debate Emotion dataset. What the paper does well is to provide a framework that lets them measure how much each view contributes to the final embedding. They correctly identify that the BERT text view is the strongest on its own. Introducing the new dataset is also a positive step for the area. The soft spots are in the evidence for the main claim. The abstract states the outperformance but supplies no numbers, no details on baselines, and no ablation results. Given that BERT is said to dominate the other views, the multi-modal improvement only makes sense if it exceeds what BERT achieves by itself. The stress-test note correctly flags that without showing gains over a BERT text-only baseline, the benefit of the audio and video fusion is not demonstrated. If the full paper has those comparisons and they hold, then the result is useful. If the experiments skip that control, then the central assumption about complementary information from the three views is not supported. This work is for specialists in multi-modal sentiment analysis who are interested in DCCA applications. It could be worth discussing in a reading group focused on that topic, but it does not have broad appeal. I would cite it only if the numbers check out against the text baseline. It should go to peer review because the experimental design is reasonable and the new dataset adds something, though revisions would be needed to strengthen the claims. The thinking is clear and engages with the literature on multi-modal fusion.

Referee Report

2 major / 1 minor

Summary. The paper proposes learning multi-modal embeddings for sentiment analysis by fusing text features (extracted via BERT), audio, and video using Deep Canonical Correlation Analysis (DCCA) in one-step and two-step variants. It evaluates the approach on two benchmark datasets plus a new Debate Emotion dataset and claims that one-step DCCA outperforms the current state-of-the-art in multi-modal embedding learning while also allowing analysis of individual view contributions.

Significance. If the empirical claims are supported by proper controls and ablations, the work would provide evidence that DCCA can still yield gains from audio/video even when a dominant text encoder like BERT is used, and the new dataset would be a useful community resource.

major comments (2)

[Abstract] Abstract: the central claim that one-Step DCCA outperforms the current state-of-the-art is asserted without any reported accuracies, baselines, statistical tests, or ablation details. The full experimental section is required to assess whether this claim is supported.
[Experimental evaluation] Experimental evaluation: the claim that one-Step DCCA outperforms prior multi-modal SOTA holds only if the reported classification accuracies exceed those of a BERT text-only baseline on the same tasks; the abstract notes that BERT dominates the other views, so absence of this control leaves the assumption that DCCA extracts complementary signal from audio/video untested.

minor comments (1)

[Abstract] Abstract: the distinction between one-Step DCCA and two-Step DCCA is mentioned but not briefly summarized, which would help readers understand the experimental framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that one-Step DCCA outperforms the current state-of-the-art is asserted without any reported accuracies, baselines, statistical tests, or ablation details. The full experimental section is required to assess whether this claim is supported.

Authors: The full manuscript contains the experimental results, including accuracies on the benchmark datasets and the new Debate Emotion dataset, comparisons against prior multi-modal baselines, and ablation studies on view contributions (see Sections 4 and 5). We agree the abstract would benefit from greater specificity. In the revision we will update the abstract to report the key accuracy improvements and note that results are statistically significant relative to the strongest baselines. revision: yes
Referee: [Experimental evaluation] Experimental evaluation: the claim that one-Step DCCA outperforms prior multi-modal SOTA holds only if the reported classification accuracies exceed those of a BERT text-only baseline on the same tasks; the abstract notes that BERT dominates the other views, so absence of this control leaves the assumption that DCCA extracts complementary signal from audio/video untested.

Authors: We acknowledge this is a substantive concern. Although the manuscript already compares against prior multi-modal embedding methods, the absence of an explicit BERT text-only control does leave the incremental value of the audio/video views via DCCA less directly tested. We will add the BERT text-only baseline to all three datasets in the revised experimental section, together with the corresponding ablation on view contributions, so that the complementary signal can be quantified. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical claims on external benchmarks

full rationale

The paper reports an empirical comparison of one-step and two-step DCCA for fusing BERT text embeddings with audio and video features on three sentiment datasets. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All performance claims are measured against external SOTA baselines and are therefore falsifiable outside the paper's own inputs. The reader's assessment of score 2.0 is consistent with the absence of any load-bearing internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that DCCA can extract useful shared structure across modalities and that the new debate dataset is representative.

pith-pipeline@v0.9.0 · 5689 in / 995 out tokens · 17724 ms · 2026-05-24T20:59:11.825781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 3 internal anchors

[1]

Introduction V arious social media platforms make available a variety of multi-modal content generated through expression of opini ons and ideologies by social media users in the form of written com- mentary, podcasts, and lifestyle vlogs on a variety of topic s such as politics, entertainment, reviews of movies, products etc. Multi-modal data enables one...

work page
[2]

En- coded outputs are then concatenated and passed through a cla s- siﬁer

Related Work The idea of combining multi-modal text, audio and video fea- tures expressed in this paper is closest to that of [1] which encodes text, speech, and visual signals using using a BiL- STM encoder, openSMILE, and 3D-CNN respectively. En- coded outputs are then concatenated and passed through a cla s- siﬁer. In contrast, this paper employs multi...

work page
[3]

This section also outlines the procedure used to o btain the multi-modal embeddings used for experiments in Section 4

Methods This section brieﬂy reviews Deep Canonical Correlation Ana l- ysis (DCCA) and outlines the methods used to obtain unimodal features. This section also outlines the procedure used to o btain the multi-modal embeddings used for experiments in Section 4. 3.1. Deep Canonical Correlation Analysis (DCCA) Classic Canonical Correlation Analysis (CCA) [14]...

work page
[4]

• Video Encoding: Framewise features from video stream are extracted using a combination of FACET2 and Open- 2https://imotions.com/facial-expressions/ Face 2.0 [16] 3

Henceforth, audio embeddings are denoted by va. • Video Encoding: Framewise features from video stream are extracted using a combination of FACET2 and Open- 2https://imotions.com/facial-expressions/ Face 2.0 [16] 3. For each 10 second duration video, video-level feature vectors are obtained by averaging across the feature vectors corresponding to individu...

work page 2016
[5]

Henceforth, video embeddings are denoted by vv. 3.3. Methodology DCCA accepts two views of data at a time and learns a corre- lated subspace. Since we are working with three views of data , all three views must be combined. We consider two different procedures. The One-Step DCCA concatenates the audio and video features and applies DCCA to this combined a...

work page
[6]

5: ˆv′ 1 = [ ¯v′ 1| v′ 1], ˆv′ 2 = [ ¯v′ 2| v′ 2]

← DCCA(v′ 1, v′ 2). 5: ˆv′ 1 = [ ¯v′ 1| v′ 1], ˆv′ 2 = [ ¯v′ 2| v′ 2]. 6: Return vmulti = [ ˆv′ 1|ˆv′ 2]. 3.3.3. Sentiment Classiﬁcation Multi-view embeddings obtained from One-Step DCCA and Two-Step DCCA are input to a logistic regression classiﬁer t o predict the sentiment label for test data sets in Section 4

work page
[7]

Multi- modal embeddings obtained from DCCA methods are input to a logistic regression classiﬁer and accuracy and F-scores o n test data sets are reported as the performance metrics

Experiments This section ﬁrst describes the different test data sets use d and the baseline methods that are evaluated against embeddings obtained from One-Step DCCA and Two-Step DCCA. Multi- modal embeddings obtained from DCCA methods are input to a logistic regression classiﬁer and accuracy and F-scores o n test data sets are reported as the performance...

work page
[8]

There area total of 17859 data points available for binary sen- timent classiﬁcation

Frames scored (0, 3] are labeled as ‘positive’ and scores between [−3, 0) are labeled as ‘negative’. There area total of 17859 data points available for binary sen- timent classiﬁcation. Data is partitioned into predeter- mined train (12787 points), validation (3634 points) and test (1438 points) splits. Raw features for CMU-MOSI and CMU-MOSEI are obtaine...

work page 2016
[9]

one-step

Discussions and Conclusions The key issue is this: which is more important, text, speech, or video? Being able to process all three views of human discour se simultaneously allows consideration of the basic question of the relative contributions of the semantics, the spoken delive ry, and the accompanying images. Conventional wisdom would be that the text...

work page
[10]

Multimodal sentiment analysis: Addressin g key issues and setting up the baselines,

S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbuk h, and A. Hussain, “Multimodal sentiment analysis: Addressin g key issues and setting up the baselines,” IEEE Intelligent Systems , vol. 33, no. 6, pp. 17–25, 2018

work page 2018
[11]

Multimodal sentiment analysis to e x- plore the structure of emotions,

A. Hu and S. Flaxman, “Multimodal sentiment analysis to e x- plore the structure of emotions,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 350–358

work page 2018
[12]

Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,

R. Arora and K. Livescu, “Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,” in 2013 IEEE International Conference on Acoustics, Speech and Sig nal Processing. IEEE, 2013, pp. 7135–7139

work page 2013
[13]

Performing populism : Trumps transgressive debate style and the dynamics of twitt er re- sponse,

E. P . Bucy, J. M. Foley, J. Lukito, L. Doroshenko, D. V . Sha h, J. Pevehouse, C. Wells, and E. P . Bucy, “Performing populism : Trumps transgressive debate style and the dynamics of twitt er re- sponse,” New Media & Society, 2018

work page 2018
[14]

Automated coding of tel e- vised leader displays: Detecting nonverbal political beha vior with computer vision,

J. Joo, E. P . Bucy, and C. Seidel, “Automated coding of tel e- vised leader displays: Detecting nonverbal political beha vior with computer vision,” in International Journal of Communication , In press

work page
[15]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: P re- training of deep bidirectional transformers for language u nder- standing,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

A. Zadeh, R. Zellers, E. Pincus, and L.-P . Morency, “Mosi : mul- timodal corpus of sentiment intensity and subjectivity ana lysis in online opinion videos,” arXiv preprint arXiv:1606.06259, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Multimodal language analysis i n the wild: CMU-MOSEI dataset and interpretable dynamic fu- sion graph,

A. Zadeh, P . P . Liang, J. V anbriesen, S. Poria, E. Cambria , M. Chen, and L.-P . Morency, “Multimodal language analysis i n the wild: CMU-MOSEI dataset and interpretable dynamic fu- sion graph,” in Association for Computational Linguistics (ACL) , 2018

work page 2018
[18]

Deep cano ni- cal correlation analysis,

G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep cano ni- cal correlation analysis,” in International conference on machine learning, 2013, pp. 1247–1255

work page 2013
[19]

Audio-visual fusion for sentiment classiﬁcation using cross-modal autoencoder

S. H. Dumpala, I. Sheikh, R. Chakraborty, and S. K. Kop- parapu, “Audio-visual fusion for sentiment classiﬁcation using cross-modal autoencoder.”

work page
[20]

Multiview lsa: R ep- resentation learning via generalized cca,

P . Rastogi, B. V an Durme, and R. Arora, “Multiview lsa: R ep- resentation learning via generalized cca,” in Proceedings of the 2015 Conference of the North American Chapter of the Associa - tion for Computational Linguistics: Human Language Techno lo- gies, 2015, pp. 556–566

work page 2015
[21]

Domain adapted w ord embeddings for improved sentiment classiﬁcation,

P . K. Sarma, Y . Liang, and B. Sethares, “Domain adapted w ord embeddings for improved sentiment classiﬁcation,” in Proceed- ings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (V olume 2: Short Papers) , vol. 2, 2018, pp. 37–42

work page 2018
[22]

Learning multiview embed- dings of twitter users,

A. Benton, R. Arora, and M. Dredze, “Learning multiview embed- dings of twitter users,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), vol. 2, 2016, pp. 14–19

work page 2016
[23]

Relations between two sets of variates,

H. Hotelling, “Relations between two sets of variates, ” in Break- throughs in statistics. Springer, 1992, pp. 162–190

work page 1992
[24]

Co- varepa collaborative voice analysis repository for speech tech- nologies,

G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scher er, “Co- varepa collaborative voice analysis repository for speech tech- nologies,” in 2014 ieee international conference on acoustics, speech and signal processing (icassp) . IEEE, 2014, pp. 960– 964

work page 2014
[25]

Open- face 2.0: Facial behavior analysis toolkit,

T. Baltrusaitis, A. Zadeh, Y . C. Lim, and L.-P . Morency, “Open- face 2.0: Facial behavior analysis toolkit,” in 2018 13th IEEE In- ternational Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 59–66

work page 2018
[26]

Cross- dataset learning and person-speciﬁc normalisation for automatic a ction unit detection,

T. Baltruˇ saitis, M. Mahmoud, and P . Robinson, “Cross- dataset learning and person-speciﬁc normalisation for automatic a ction unit detection,” in 2015 11th IEEE International Conference and W orkshops on Automatic Face and Gesture Recognition (FG) , vol. 6. IEEE, 2015, pp. 1–6

work page 2015
[27]

Multi-attention Recurrent Network for Human Communication Comprehension

A. Zadeh, P . P . Liang, S. Poria, P . Vij, E. Cambria, and L. - P . Morency, “Multi-attention recurrent network for human c om- munication comprehension,” arXiv preprint arXiv:1802.00923 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Generalized canonical correlations and the ir applica- tions to experimental data,

P . Horst, “Generalized canonical correlations and the ir applica- tions to experimental data,” in Journal of Clinical Psychology, 17(4), 1961

work page 1961
[29]

M. E. Grabe and E. P . Bucy, Image bite politics: News and the visual framing of elections. Oxford University Press, 2009

work page 2009

[1] [1]

Introduction V arious social media platforms make available a variety of multi-modal content generated through expression of opini ons and ideologies by social media users in the form of written com- mentary, podcasts, and lifestyle vlogs on a variety of topic s such as politics, entertainment, reviews of movies, products etc. Multi-modal data enables one...

work page

[2] [2]

En- coded outputs are then concatenated and passed through a cla s- siﬁer

Related Work The idea of combining multi-modal text, audio and video fea- tures expressed in this paper is closest to that of [1] which encodes text, speech, and visual signals using using a BiL- STM encoder, openSMILE, and 3D-CNN respectively. En- coded outputs are then concatenated and passed through a cla s- siﬁer. In contrast, this paper employs multi...

work page

[3] [3]

This section also outlines the procedure used to o btain the multi-modal embeddings used for experiments in Section 4

Methods This section brieﬂy reviews Deep Canonical Correlation Ana l- ysis (DCCA) and outlines the methods used to obtain unimodal features. This section also outlines the procedure used to o btain the multi-modal embeddings used for experiments in Section 4. 3.1. Deep Canonical Correlation Analysis (DCCA) Classic Canonical Correlation Analysis (CCA) [14]...

work page

[4] [4]

• Video Encoding: Framewise features from video stream are extracted using a combination of FACET2 and Open- 2https://imotions.com/facial-expressions/ Face 2.0 [16] 3

Henceforth, audio embeddings are denoted by va. • Video Encoding: Framewise features from video stream are extracted using a combination of FACET2 and Open- 2https://imotions.com/facial-expressions/ Face 2.0 [16] 3. For each 10 second duration video, video-level feature vectors are obtained by averaging across the feature vectors corresponding to individu...

work page 2016

[5] [5]

Henceforth, video embeddings are denoted by vv. 3.3. Methodology DCCA accepts two views of data at a time and learns a corre- lated subspace. Since we are working with three views of data , all three views must be combined. We consider two different procedures. The One-Step DCCA concatenates the audio and video features and applies DCCA to this combined a...

work page

[6] [6]

5: ˆv′ 1 = [ ¯v′ 1| v′ 1], ˆv′ 2 = [ ¯v′ 2| v′ 2]

← DCCA(v′ 1, v′ 2). 5: ˆv′ 1 = [ ¯v′ 1| v′ 1], ˆv′ 2 = [ ¯v′ 2| v′ 2]. 6: Return vmulti = [ ˆv′ 1|ˆv′ 2]. 3.3.3. Sentiment Classiﬁcation Multi-view embeddings obtained from One-Step DCCA and Two-Step DCCA are input to a logistic regression classiﬁer t o predict the sentiment label for test data sets in Section 4

work page

[7] [7]

Multi- modal embeddings obtained from DCCA methods are input to a logistic regression classiﬁer and accuracy and F-scores o n test data sets are reported as the performance metrics

Experiments This section ﬁrst describes the different test data sets use d and the baseline methods that are evaluated against embeddings obtained from One-Step DCCA and Two-Step DCCA. Multi- modal embeddings obtained from DCCA methods are input to a logistic regression classiﬁer and accuracy and F-scores o n test data sets are reported as the performance...

work page

[8] [8]

There area total of 17859 data points available for binary sen- timent classiﬁcation

Frames scored (0, 3] are labeled as ‘positive’ and scores between [−3, 0) are labeled as ‘negative’. There area total of 17859 data points available for binary sen- timent classiﬁcation. Data is partitioned into predeter- mined train (12787 points), validation (3634 points) and test (1438 points) splits. Raw features for CMU-MOSI and CMU-MOSEI are obtaine...

work page 2016

[9] [9]

one-step

Discussions and Conclusions The key issue is this: which is more important, text, speech, or video? Being able to process all three views of human discour se simultaneously allows consideration of the basic question of the relative contributions of the semantics, the spoken delive ry, and the accompanying images. Conventional wisdom would be that the text...

work page

[10] [10]

Multimodal sentiment analysis: Addressin g key issues and setting up the baselines,

S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbuk h, and A. Hussain, “Multimodal sentiment analysis: Addressin g key issues and setting up the baselines,” IEEE Intelligent Systems , vol. 33, no. 6, pp. 17–25, 2018

work page 2018

[11] [11]

Multimodal sentiment analysis to e x- plore the structure of emotions,

A. Hu and S. Flaxman, “Multimodal sentiment analysis to e x- plore the structure of emotions,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 350–358

work page 2018

[12] [12]

Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,

R. Arora and K. Livescu, “Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,” in 2013 IEEE International Conference on Acoustics, Speech and Sig nal Processing. IEEE, 2013, pp. 7135–7139

work page 2013

[13] [13]

Performing populism : Trumps transgressive debate style and the dynamics of twitt er re- sponse,

E. P . Bucy, J. M. Foley, J. Lukito, L. Doroshenko, D. V . Sha h, J. Pevehouse, C. Wells, and E. P . Bucy, “Performing populism : Trumps transgressive debate style and the dynamics of twitt er re- sponse,” New Media & Society, 2018

work page 2018

[14] [14]

Automated coding of tel e- vised leader displays: Detecting nonverbal political beha vior with computer vision,

J. Joo, E. P . Bucy, and C. Seidel, “Automated coding of tel e- vised leader displays: Detecting nonverbal political beha vior with computer vision,” in International Journal of Communication , In press

work page

[15] [15]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: P re- training of deep bidirectional transformers for language u nder- standing,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

A. Zadeh, R. Zellers, E. Pincus, and L.-P . Morency, “Mosi : mul- timodal corpus of sentiment intensity and subjectivity ana lysis in online opinion videos,” arXiv preprint arXiv:1606.06259, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

Multimodal language analysis i n the wild: CMU-MOSEI dataset and interpretable dynamic fu- sion graph,

A. Zadeh, P . P . Liang, J. V anbriesen, S. Poria, E. Cambria , M. Chen, and L.-P . Morency, “Multimodal language analysis i n the wild: CMU-MOSEI dataset and interpretable dynamic fu- sion graph,” in Association for Computational Linguistics (ACL) , 2018

work page 2018

[18] [18]

Deep cano ni- cal correlation analysis,

G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep cano ni- cal correlation analysis,” in International conference on machine learning, 2013, pp. 1247–1255

work page 2013

[19] [19]

Audio-visual fusion for sentiment classiﬁcation using cross-modal autoencoder

S. H. Dumpala, I. Sheikh, R. Chakraborty, and S. K. Kop- parapu, “Audio-visual fusion for sentiment classiﬁcation using cross-modal autoencoder.”

work page

[20] [20]

Multiview lsa: R ep- resentation learning via generalized cca,

P . Rastogi, B. V an Durme, and R. Arora, “Multiview lsa: R ep- resentation learning via generalized cca,” in Proceedings of the 2015 Conference of the North American Chapter of the Associa - tion for Computational Linguistics: Human Language Techno lo- gies, 2015, pp. 556–566

work page 2015

[21] [21]

Domain adapted w ord embeddings for improved sentiment classiﬁcation,

P . K. Sarma, Y . Liang, and B. Sethares, “Domain adapted w ord embeddings for improved sentiment classiﬁcation,” in Proceed- ings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (V olume 2: Short Papers) , vol. 2, 2018, pp. 37–42

work page 2018

[22] [22]

Learning multiview embed- dings of twitter users,

A. Benton, R. Arora, and M. Dredze, “Learning multiview embed- dings of twitter users,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), vol. 2, 2016, pp. 14–19

work page 2016

[23] [23]

Relations between two sets of variates,

H. Hotelling, “Relations between two sets of variates, ” in Break- throughs in statistics. Springer, 1992, pp. 162–190

work page 1992

[24] [24]

Co- varepa collaborative voice analysis repository for speech tech- nologies,

G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scher er, “Co- varepa collaborative voice analysis repository for speech tech- nologies,” in 2014 ieee international conference on acoustics, speech and signal processing (icassp) . IEEE, 2014, pp. 960– 964

work page 2014

[25] [25]

Open- face 2.0: Facial behavior analysis toolkit,

T. Baltrusaitis, A. Zadeh, Y . C. Lim, and L.-P . Morency, “Open- face 2.0: Facial behavior analysis toolkit,” in 2018 13th IEEE In- ternational Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 59–66

work page 2018

[26] [26]

Cross- dataset learning and person-speciﬁc normalisation for automatic a ction unit detection,

T. Baltruˇ saitis, M. Mahmoud, and P . Robinson, “Cross- dataset learning and person-speciﬁc normalisation for automatic a ction unit detection,” in 2015 11th IEEE International Conference and W orkshops on Automatic Face and Gesture Recognition (FG) , vol. 6. IEEE, 2015, pp. 1–6

work page 2015

[27] [27]

Multi-attention Recurrent Network for Human Communication Comprehension

A. Zadeh, P . P . Liang, S. Poria, P . Vij, E. Cambria, and L. - P . Morency, “Multi-attention recurrent network for human c om- munication comprehension,” arXiv preprint arXiv:1802.00923 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Generalized canonical correlations and the ir applica- tions to experimental data,

P . Horst, “Generalized canonical correlations and the ir applica- tions to experimental data,” in Journal of Clinical Psychology, 17(4), 1961

work page 1961

[29] [29]

M. E. Grabe and E. P . Bucy, Image bite politics: News and the visual framing of elections. Oxford University Press, 2009

work page 2009