EmotionX-HSU: Adopting Pre-trained BERT for Emotion Classification

Linkai Luo; Yue Wang

arxiv: 1907.09669 · v1 · pith:6TDTKLH4new · submitted 2019-07-23 · 💻 cs.CL

EmotionX-HSU: Adopting Pre-trained BERT for Emotion Classification

Linkai Luo , Yue Wang This is my paper

Pith reviewed 2026-05-24 18:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords emotion classificationBERTtransfer learningconversational emotionFriends datasetEmotionPushfine-tuning

0 comments

The pith

Fine-tuning pre-trained BERT encodes utterances and classifies one of four emotions per line with micro-F1 scores of 79.1% on Friends and 86.2% on EmotionPush.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts a BERT model pre-trained on large text corpora to the task of labeling each utterance with one of four emotions. It first converts every utterance into a sequence of vectors that capture its meaning, then passes those vectors through a softmax layer for classification. Because labeled conversational data are scarce, the authors transfer the general knowledge already present in BERT and fine-tune the model on the two target datasets. The resulting system reaches the reported micro-F1 scores and places third among eleven entries in the EmotionX-2019 shared task. A sympathetic reader cares because the work shows a concrete way to build usable emotion detectors when only modest amounts of in-domain labels are available.

Core claim

Encoding each utterance with BERT and then fine-tuning the model on the in-domain conversational data enables prediction of one of four emotions, measured by micro-F1 scores of 79.1 percent on the Friends test set and 86.2 percent on the EmotionPush test set.

What carries the argument

BERT, the pre-trained bidirectional transformer that produces a sequence of vectors representing each utterance's meaning for input to the downstream softmax classifier.

If this is right

The same two-step encoding-plus-softmax pipeline works for both scripted television dialogue and informal chat logs.
Fine-tuning allows the pre-trained model to adapt its general language knowledge to the specific emotion labels of the target datasets.
Competitive shared-task performance is achievable even when the amount of labeled conversational data is limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transfer approach would likely apply to other low-resource utterance classification tasks such as intent detection in customer-service logs.
Feeding preceding turns as additional context into the BERT encoder could raise accuracy beyond the single-utterance results reported here.
Pre-trained encoders reduce the volume of labeled examples needed for new conversational NLP tasks.

Load-bearing premise

BERT's representations learned from general text transfer usefully to emotion classification in TV dialogues and Facebook chat logs without substantial domain mismatch.

What would settle it

Evaluating the identical fine-tuned model on a fresh conversational corpus drawn from a markedly different domain, such as customer-service transcripts, and obtaining micro-F1 scores substantially below 70 percent would show the transfer failed.

Figures

Figures reproduced from arXiv: 1907.09669 by Linkai Luo, Yue Wang.

read the original abstract

This paper describes our approach to the EmotionX-2019, the shared task of SocialNLP 2019. To detect emotion for each utterance of two datasets from the TV show Friends and Facebook chat log EmotionPush, we propose two-step deep learning based methodology: (i) encode each of the utterance into a sequence of vectors that represent its meaning; and (ii) use a simply softmax classifier to predict one of the emotions amongst four candidates that an utterance may carry. Notice that the source of labeled utterances is not rich, we utilise a well-trained model, known as BERT, to transfer part of the knowledge learned from a large amount of corpus to our model. We then focus on fine-tuning our model until it well fits to the in-domain data. The performance of the proposed model is evaluated by micro-F1 scores, i.e., 79.1% and 86.2% for the testsets of Friends and EmotionPush, respectively. Our model ranks 3rd among 11 submissions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard BERT fine-tuning on two shared-task emotion datasets yields 79.1% and 86.2% micro-F1 with no controls to show pretraining helps.

read the letter

This is a direct fine-tuning of BERT plus a softmax head on the Friends and EmotionPush datasets for the EmotionX-2019 shared task. The authors encode utterances, fine-tune on the limited labeled data, and report the two micro-F1 numbers that placed them third out of eleven submissions. That is the full contribution on the page. They do the obvious thing and deliver the requested scores on the held-out test sets. The description is clear enough about the two-step pipeline and the motivation for using a pre-trained model when labels are scarce. The numbers themselves are concrete empirical measurements rather than derived claims. The soft spot is exactly the one in the stress-test note. Nothing isolates whether the pre-trained weights are responsible for the performance or whether a randomly initialized model with the same fine-tuning steps would reach similar results on these small conversational sets. No from-scratch baseline, no non-BERT comparator, and no ablation on the pretraining benefit appear in the work. Without those, the transfer story stays untested and domain mismatch remains possible. The paper introduces no new method, derivation, or framework. It is a participation report. Readers who need the specific scores on these two datasets or who track shared-task leaderboards will find the numbers useful. Anyone looking for insight into BERT transfer, emotion classification techniques, or reproducible controls will not. I would not send this to peer review. It is a competent workshop-style submission but lacks the controls or novelty that would justify referee effort.

Referee Report

1 major / 2 minor

Summary. The paper proposes a two-step pipeline for the EmotionX-2019 shared task: BERT is used to encode each utterance into a sequence of vectors, followed by a simple softmax classifier to predict one of four emotions. The model is fine-tuned on the Friends and EmotionPush datasets and achieves micro-F1 scores of 79.1% and 86.2% on the respective test sets, ranking 3rd among 11 submissions.

Significance. If the central claim holds, the work demonstrates that fine-tuning a pre-trained BERT encoder yields competitive results on conversational emotion classification benchmarks with limited labeled data, providing a practical example of transfer learning for dialogue and social-media NLP tasks.

major comments (1)

[Experimental results / abstract] Experimental results / abstract: The reported micro-F1 scores of 79.1% (Friends) and 86.2% (EmotionPush) are presented as evidence that BERT pre-training transfers useful knowledge, yet no baseline with a randomly initialized encoder or from-scratch training on the identical architecture and splits is provided. This omission leaves the transfer-learning contribution unisolated and the central claim untested.

minor comments (2)

[Abstract] Abstract: 'a simply softmax classifier' contains a grammatical error and should read 'a simple softmax classifier'.
[Abstract] Abstract: The phrasing 'until it well fits to the in-domain data' is awkward; revise for grammatical correctness and clarity (e.g., 'until it fits the in-domain data well').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and agree that the requested ablation would strengthen the paper.

read point-by-point responses

Referee: [Experimental results / abstract] Experimental results / abstract: The reported micro-F1 scores of 79.1% (Friends) and 86.2% (EmotionPush) are presented as evidence that BERT pre-training transfers useful knowledge, yet no baseline with a randomly initialized encoder or from-scratch training on the identical architecture and splits is provided. This omission leaves the transfer-learning contribution unisolated and the central claim untested.

Authors: We acknowledge the validity of this observation. The manuscript presents the fine-tuned BERT results as evidence of effective transfer on limited conversational data and notes the competitive ranking, but does not include a random-initialization control on the same architecture and splits. While the benefit of BERT pre-training is supported by prior literature and our focus was on domain adaptation within the shared-task constraints, we agree that an explicit ablation would better isolate the contribution. We will add the requested baseline experiments (randomly initialized encoder + identical classifier and training protocol) to the revised manuscript and update the abstract and experimental section accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on external shared-task benchmarks

full rationale

The manuscript describes a standard two-step procedure: encode utterances with pre-trained BERT then apply a softmax classifier, followed by fine-tuning and reporting micro-F1 on the EmotionX-2019 test sets (79.1% Friends, 86.2% EmotionPush). These scores are direct empirical measurements on externally provided held-out data, not quantities derived from internal fits or self-referential definitions. No equations, no parameter-fitting steps presented as predictions, and no self-citations invoked to establish uniqueness or ansatzes. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of BERT representations to the target emotion task, which is taken from the prior BERT paper rather than re-derived here.

axioms (1)

domain assumption BERT pre-trained on large general corpus provides useful representations that transfer to four-class emotion classification on Friends and EmotionPush data
Invoked when the authors state they utilise BERT to transfer knowledge because labeled data is not rich.

pith-pipeline@v0.9.0 · 5701 in / 1354 out tokens · 25095 ms · 2026-05-24T18:00:02.799085+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Super- vised learning of universal sentence representations from natural language inference data

[Conneau et al., 2017] Alexis Conneau, Douwe Kiela, Hol- ger Schwenk, Lo¨ıc Barrault, and Antoine Bordes. Super- vised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages...

work page 2017
[2]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

[Dai et al., 2019] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V . Le, and Ruslan Salakhutdi- nov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. CoRR, abs/1901.02860,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Ken- ton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understand- ing. CoRR, abs/1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Huang, and Lun-Wei Ku

[Hsu et al., 2018] Chao-Chun Hsu, Sheng-Yeh Chen, Chuan-Chun Kuo, Ting-Hao K. Huang, and Lun-Wei Ku. Emotionlines: An emotion corpus of multi-party con- versations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018.,

work page 2018
[5]

Attnconvnet at semeval-2018 task 1: Attention-based convolutional neural networks for multi- label emotion classiﬁcation

[Kim et al., 2018] Yanghoon Kim, Hwanhee Lee, and Ky- omin Jung. Attnconvnet at semeval-2018 task 1: Attention-based convolutional neural networks for multi- label emotion classiﬁcation. In Proceedings of The 12th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT, New Orleans, Louisiana, June 5- 6, 2018, pages 141–145,

work page 2018
[6]

Convolutional neural networks for sentence classiﬁcation

[Kim, 2014] Yoon Kim. Convolutional neural networks for sentence classiﬁcation. In Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Process- ing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL , pages 1746–1751,

work page 2014
[7]

Recurrent neural network for text classiﬁcation with multi-task learning

[Liu et al., 2016] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classiﬁcation with multi-task learning. In Proceedings of the Twenty- Fifth International Joint Conference on Artiﬁcial Intelli- gence, IJCAI 2016, New York, NY, USA, 9-15 July 2016 , pages 2873–2879,

work page 2016
[8]

[Luo et al., 2018] Linkai Luo, Haiqing Yang, and Francis Y . L. Chin. Emotionx-dlc: Self-attentive bilstm for detect- ing sequential emotions in dialogues. InProceedings of the Sixth International Workshop on Natural Language Pro- cessing for Social Media, SocialNLP@ACL 2018, Mel- bourne, Australia, July 20, 2018, pages 32–36,

work page 2018
[9]

Corrado, and Jeffrey Dean

[Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their composi- tionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Informa- tion Processing Systems

work page 2013
[10]

Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3111–3119,

work page 2013
[11]

[Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vec- tors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543,

work page 2014
[12]

Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

[Peters et al., 2018] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Lo...

work page 2018
[13]

Improving language understanding by generative pre-training

[Radford et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3- us-west-2. amazonaws. com/openai-assets/research- covers/languageunsupervised/language understanding paper. pdf,

work page 2018
[14]

Language models are unsupervised multitask learners

[Radford et al., 2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8,

work page 2019
[15]

Gomez, Lukasz Kaiser, and Illia Polosukhin

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Sys- tems 30: Annual Conference on Neural Information Pro- cessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010, 2017

work page 2017

[1] [1]

Super- vised learning of universal sentence representations from natural language inference data

[Conneau et al., 2017] Alexis Conneau, Douwe Kiela, Hol- ger Schwenk, Lo¨ıc Barrault, and Antoine Bordes. Super- vised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages...

work page 2017

[2] [2]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

[Dai et al., 2019] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V . Le, and Ruslan Salakhutdi- nov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. CoRR, abs/1901.02860,

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Ken- ton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understand- ing. CoRR, abs/1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Huang, and Lun-Wei Ku

[Hsu et al., 2018] Chao-Chun Hsu, Sheng-Yeh Chen, Chuan-Chun Kuo, Ting-Hao K. Huang, and Lun-Wei Ku. Emotionlines: An emotion corpus of multi-party con- versations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018.,

work page 2018

[5] [5]

Attnconvnet at semeval-2018 task 1: Attention-based convolutional neural networks for multi- label emotion classiﬁcation

[Kim et al., 2018] Yanghoon Kim, Hwanhee Lee, and Ky- omin Jung. Attnconvnet at semeval-2018 task 1: Attention-based convolutional neural networks for multi- label emotion classiﬁcation. In Proceedings of The 12th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT, New Orleans, Louisiana, June 5- 6, 2018, pages 141–145,

work page 2018

[6] [6]

Convolutional neural networks for sentence classiﬁcation

[Kim, 2014] Yoon Kim. Convolutional neural networks for sentence classiﬁcation. In Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Process- ing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL , pages 1746–1751,

work page 2014

[7] [7]

Recurrent neural network for text classiﬁcation with multi-task learning

[Liu et al., 2016] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classiﬁcation with multi-task learning. In Proceedings of the Twenty- Fifth International Joint Conference on Artiﬁcial Intelli- gence, IJCAI 2016, New York, NY, USA, 9-15 July 2016 , pages 2873–2879,

work page 2016

[8] [8]

[Luo et al., 2018] Linkai Luo, Haiqing Yang, and Francis Y . L. Chin. Emotionx-dlc: Self-attentive bilstm for detect- ing sequential emotions in dialogues. InProceedings of the Sixth International Workshop on Natural Language Pro- cessing for Social Media, SocialNLP@ACL 2018, Mel- bourne, Australia, July 20, 2018, pages 32–36,

work page 2018

[9] [9]

Corrado, and Jeffrey Dean

[Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their composi- tionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Informa- tion Processing Systems

work page 2013

[10] [10]

Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3111–3119,

work page 2013

[11] [11]

[Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vec- tors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543,

work page 2014

[12] [12]

Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

[Peters et al., 2018] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Lo...

work page 2018

[13] [13]

Improving language understanding by generative pre-training

[Radford et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3- us-west-2. amazonaws. com/openai-assets/research- covers/languageunsupervised/language understanding paper. pdf,

work page 2018

[14] [14]

Language models are unsupervised multitask learners

[Radford et al., 2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8,

work page 2019

[15] [15]

Gomez, Lukasz Kaiser, and Illia Polosukhin

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Sys- tems 30: Annual Conference on Neural Information Pro- cessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010, 2017

work page 2017