Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction

Bobak Farzin; Jeremy Howard; Piotr Czapla

arxiv: 1907.03187 · v1 · pith:KY6OGAILnew · submitted 2019-07-06 · 💻 cs.CL

Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction

Bobak Farzin , Piotr Czapla , Jeremy Howard This is my paper

Pith reviewed 2026-05-25 01:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelhumor detectionSpanish Twittertransfer learninglabel smoothingHAHA challengetext classification

0 comments

The pith

A language model pre-trained from scratch on Spanish Twitter data transfers effectively to humor prediction, placing third in classification and second in regression for the HAHA 2019 challenge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that pre-training a language model on a large Spanish Twitter corpus and transferring it to a humor detection task yields competitive results in a shared challenge. The system outperforms a Naive Bayes baseline while incorporating label smoothing to reduce the impact of noisy labels. A sympathetic reader would care because the work shows a practical way to adapt general language modeling techniques to social media humor analysis in Spanish without relying on English-centric resources.

Core claim

We trained a language model from scratch on a large Twitter-based Spanish corpus and transferred that knowledge to our competition model for the HAHA 2019 Challenge, achieving 3rd place in the classification task and 2nd place in the regression task, while using label smoothing in the loss function to address inherent label errors.

What carries the argument

The Spanish Twitter language model pre-trained from scratch, which performs the knowledge transfer to the downstream humor classification and regression tasks.

If this is right

The same pre-training plus fine-tuning pipeline can be applied to other Spanish social media classification tasks.
Label smoothing reduces overconfidence on crowdsourced humor labels and improves generalization.
The released code and model enable direct replication and extension by others on similar Twitter humor datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the Twitter corpus captures dialectal variation well, the method could extend to other regional Spanish varieties with minimal additional data.
The success against a simple baseline suggests that pre-training scale matters more than task-specific feature engineering for this domain.
Similar pre-training on other low-resource social media languages might close performance gaps with English systems.

Load-bearing premise

Training a language model from scratch on a large Twitter corpus provides effective knowledge transfer to the humor prediction task despite potential label noise addressed by smoothing.

What would settle it

A replication that trains the same downstream model from random initialization on the HAHA data alone and matches or exceeds the reported rankings would indicate the pre-training step adds little value.

Figures

Figures reproduced from arXiv: 1907.03187 by Bobak Farzin, Jeremy Howard, Piotr Czapla.

read the original abstract

Our entry into the HAHA 2019 Challenge placed $3^{rd}$ in the classification task and $2^{nd}$ in the regression task. We describe our system and innovations, as well as comparing our results to a Naive Bayes baseline. A large Twitter based corpus allowed us to train a language model from scratch focused on Spanish and transfer that knowledge to our competition model. To overcome the inherent errors in some labels we reduce our class confidence with label smoothing in the loss function. All the code for our project is included in a GitHub repository for easy reference and to enable replication by others.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper reports their entry into the HAHA 2019 Challenge, which placed 3rd in the classification task and 2nd in the regression task. The system pre-trains a language model from scratch on a large Spanish Twitter corpus, transfers the knowledge to the humor prediction task, and applies label smoothing in the loss function to address label noise; results are compared to a Naive Bayes baseline, with all code released on GitHub.

Significance. If the reported competition rankings hold, the work provides a concrete demonstration of the value of domain-specific pre-training on Twitter data for Spanish social-media NLP tasks and the practical application of label smoothing for noisy supervision. The explicit release of replication code is a strength that supports verification and reuse.

minor comments (3)

[Abstract, Results] Abstract and results sections report only the final competition rankings without the underlying metrics (e.g., F1, accuracy, or RMSE values) achieved by the submitted system or the Naive Bayes baseline. Including these numbers would allow readers to assess the magnitude of improvement independently of the external ranking.
[Method] The description of the pre-training procedure, model architecture, and hyper-parameters is high-level; while the GitHub repository is referenced, key details (corpus size, training steps, smoothing parameter value) should be stated in the paper for self-contained reading.
[Results] No error analysis or qualitative examples of predictions are provided, which would help explain the sources of the reported performance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review of our HAHA 2019 submission. The referee summary accurately reflects the paper's content, and we appreciate the positive assessment of the domain-specific pre-training and code release. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical competition rankings (3rd classification, 2nd regression) achieved by a described system of Twitter LM pretraining, transfer learning, and label smoothing. No mathematical derivation, equations, or fitted-parameter predictions are present; the central claims are factual statements about external challenge results and are supported by a public GitHub repository. No self-citation chains, self-definitional steps, or reductions of outputs to inputs by construction exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities beyond standard machine learning practices; the work is an empirical application report based on the abstract.

pith-pipeline@v0.9.0 · 5622 in / 895 out tokens · 21244 ms · 2026-05-25T01:26:14.416740+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 8 internal anchors

[1]

Quasi-Recurrent Neural Networks

Bradbury, J., Merity, S., Xiong, C., Socher, R.: Quasi-recurrent neural networks. CoRR abs/1611.01576 (2016), http://arxiv.org/abs/1611.01576

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

In: Proceedings of the Sixth International Work- shop on Natural Language Processing for Social Media

Castro, S., Chiruzzo, L., Ros´ a, A., Garat, D., Moncecchi, G.: A crowd-annotated spanish corpus for humor analysis. In: Proceedings of the Sixth International Work- shop on Natural Language Processing for Social Media. pp. 7–11 (2018) 9 http://forums.fasta.ai

work page 2018
[3]

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (Jun 2002), http://dl.acm.org/citation.cfm?id=1622407.1622416

work page arXiv 2002
[4]

In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)

Chiruzzo, L., Castro, S., Etcheverry, M., Garat, D., Prada, J.J., Ros´ a, A.: Overview of HAHA at IberLEF 2019: Humor Analysis based on Human Annotation. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). CEUR Workshop Proceedings, CEUR-WS, Bilbao, Spain (9 2019)

work page 2019
[5]

Universal Language Model Fine-Tuning with Subword Tokenization for Polish

Czapla, P., Howard, J., Kardas, M.: Universal language model ﬁne-tuning with subword tokenization for polish. CoRR abs/1810.10222 (2018), http://arxiv. org/abs/1810.10222

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Universal Language Model Fine-tuning for Text Classification

Howard, J., Ruder, S.: Universal language model ﬁne-tuning for text classiﬁcation. CoRR abs/1801.06146 (2018), http://arxiv.org/abs/1801.06146

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Kudo, T., Richardson, J.: Sentencepiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. CoRRabs/1808.06226 (2018), http://arxiv.org/abs/1808.06226

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Regularizing and Optimizing LSTM Language Models

Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. CoRR abs/1708.02182 (2017), http://arxiv.org/abs/1708.02182

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Regularizing Neural Networks by Penalizing Confident Output Distributions

Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.E.: Regularizing neu- ral networks by penalizing conﬁdent output distributions. CoRRabs/1701.06548 (2017), http://arxiv.org/abs/1701.06548

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Using the Output Embedding to Improve Language Models

Press, O., Wolf, L.: Using the output embedding to improve language models. CoRR abs/1608.05859 (2016), http://arxiv.org/abs/1608.05859

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Smith, L.N.: A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. CoRR abs/1803.09820 (2018), http://arxiv.org/abs/1803.09820

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

Wang, S., Manning, C.D.: Baselines and bigrams: Simple, good sentiment and topic classiﬁcation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2. pp. 90–94. ACL ’12, Association for Computational Linguistics, Stroudsburg, PA, USA (2012), http: //dl.acm.org/citation.cfm?id=2390665.2390688

work page arXiv 2012

[1] [1]

Quasi-Recurrent Neural Networks

Bradbury, J., Merity, S., Xiong, C., Socher, R.: Quasi-recurrent neural networks. CoRR abs/1611.01576 (2016), http://arxiv.org/abs/1611.01576

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

In: Proceedings of the Sixth International Work- shop on Natural Language Processing for Social Media

Castro, S., Chiruzzo, L., Ros´ a, A., Garat, D., Moncecchi, G.: A crowd-annotated spanish corpus for humor analysis. In: Proceedings of the Sixth International Work- shop on Natural Language Processing for Social Media. pp. 7–11 (2018) 9 http://forums.fasta.ai

work page 2018

[3] [3]

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (Jun 2002), http://dl.acm.org/citation.cfm?id=1622407.1622416

work page arXiv 2002

[4] [4]

In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)

Chiruzzo, L., Castro, S., Etcheverry, M., Garat, D., Prada, J.J., Ros´ a, A.: Overview of HAHA at IberLEF 2019: Humor Analysis based on Human Annotation. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). CEUR Workshop Proceedings, CEUR-WS, Bilbao, Spain (9 2019)

work page 2019

[5] [5]

Universal Language Model Fine-Tuning with Subword Tokenization for Polish

Czapla, P., Howard, J., Kardas, M.: Universal language model ﬁne-tuning with subword tokenization for polish. CoRR abs/1810.10222 (2018), http://arxiv. org/abs/1810.10222

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Universal Language Model Fine-tuning for Text Classification

Howard, J., Ruder, S.: Universal language model ﬁne-tuning for text classiﬁcation. CoRR abs/1801.06146 (2018), http://arxiv.org/abs/1801.06146

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Kudo, T., Richardson, J.: Sentencepiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. CoRRabs/1808.06226 (2018), http://arxiv.org/abs/1808.06226

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Regularizing and Optimizing LSTM Language Models

Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. CoRR abs/1708.02182 (2017), http://arxiv.org/abs/1708.02182

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Regularizing Neural Networks by Penalizing Confident Output Distributions

Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.E.: Regularizing neu- ral networks by penalizing conﬁdent output distributions. CoRRabs/1701.06548 (2017), http://arxiv.org/abs/1701.06548

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Using the Output Embedding to Improve Language Models

Press, O., Wolf, L.: Using the output embedding to improve language models. CoRR abs/1608.05859 (2016), http://arxiv.org/abs/1608.05859

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Smith, L.N.: A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. CoRR abs/1803.09820 (2018), http://arxiv.org/abs/1803.09820

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

Wang, S., Manning, C.D.: Baselines and bigrams: Simple, good sentiment and topic classiﬁcation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2. pp. 90–94. ACL ’12, Association for Computational Linguistics, Stroudsburg, PA, USA (2012), http: //dl.acm.org/citation.cfm?id=2390665.2390688

work page arXiv 2012