pith. sign in

arxiv: 1907.03187 · v1 · pith:KY6OGAILnew · submitted 2019-07-06 · 💻 cs.CL

Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction

Pith reviewed 2026-05-25 01:26 UTC · model grok-4.3

classification 💻 cs.CL
keywords language modelhumor detectionSpanish Twittertransfer learninglabel smoothingHAHA challengetext classification
0
0 comments X

The pith

A language model pre-trained from scratch on Spanish Twitter data transfers effectively to humor prediction, placing third in classification and second in regression for the HAHA 2019 challenge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that pre-training a language model on a large Spanish Twitter corpus and transferring it to a humor detection task yields competitive results in a shared challenge. The system outperforms a Naive Bayes baseline while incorporating label smoothing to reduce the impact of noisy labels. A sympathetic reader would care because the work shows a practical way to adapt general language modeling techniques to social media humor analysis in Spanish without relying on English-centric resources.

Core claim

We trained a language model from scratch on a large Twitter-based Spanish corpus and transferred that knowledge to our competition model for the HAHA 2019 Challenge, achieving 3rd place in the classification task and 2nd place in the regression task, while using label smoothing in the loss function to address inherent label errors.

What carries the argument

The Spanish Twitter language model pre-trained from scratch, which performs the knowledge transfer to the downstream humor classification and regression tasks.

If this is right

  • The same pre-training plus fine-tuning pipeline can be applied to other Spanish social media classification tasks.
  • Label smoothing reduces overconfidence on crowdsourced humor labels and improves generalization.
  • The released code and model enable direct replication and extension by others on similar Twitter humor datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the Twitter corpus captures dialectal variation well, the method could extend to other regional Spanish varieties with minimal additional data.
  • The success against a simple baseline suggests that pre-training scale matters more than task-specific feature engineering for this domain.
  • Similar pre-training on other low-resource social media languages might close performance gaps with English systems.

Load-bearing premise

Training a language model from scratch on a large Twitter corpus provides effective knowledge transfer to the humor prediction task despite potential label noise addressed by smoothing.

What would settle it

A replication that trains the same downstream model from random initialization on the HAHA data alone and matches or exceeds the reported rankings would indicate the pre-training step adds little value.

Figures

Figures reproduced from arXiv: 1907.03187 by Bobak Farzin, Jeremy Howard, Piotr Czapla.

Figure 1
Figure 1. Figure 1: Histogram of F1 metric averaged across 5-fold metric 4.4 Results [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

Our entry into the HAHA 2019 Challenge placed $3^{rd}$ in the classification task and $2^{nd}$ in the regression task. We describe our system and innovations, as well as comparing our results to a Naive Bayes baseline. A large Twitter based corpus allowed us to train a language model from scratch focused on Spanish and transfer that knowledge to our competition model. To overcome the inherent errors in some labels we reduce our class confidence with label smoothing in the loss function. All the code for our project is included in a GitHub repository for easy reference and to enable replication by others.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper reports their entry into the HAHA 2019 Challenge, which placed 3rd in the classification task and 2nd in the regression task. The system pre-trains a language model from scratch on a large Spanish Twitter corpus, transfers the knowledge to the humor prediction task, and applies label smoothing in the loss function to address label noise; results are compared to a Naive Bayes baseline, with all code released on GitHub.

Significance. If the reported competition rankings hold, the work provides a concrete demonstration of the value of domain-specific pre-training on Twitter data for Spanish social-media NLP tasks and the practical application of label smoothing for noisy supervision. The explicit release of replication code is a strength that supports verification and reuse.

minor comments (3)
  1. [Abstract, Results] Abstract and results sections report only the final competition rankings without the underlying metrics (e.g., F1, accuracy, or RMSE values) achieved by the submitted system or the Naive Bayes baseline. Including these numbers would allow readers to assess the magnitude of improvement independently of the external ranking.
  2. [Method] The description of the pre-training procedure, model architecture, and hyper-parameters is high-level; while the GitHub repository is referenced, key details (corpus size, training steps, smoothing parameter value) should be stated in the paper for self-contained reading.
  3. [Results] No error analysis or qualitative examples of predictions are provided, which would help explain the sources of the reported performance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review of our HAHA 2019 submission. The referee summary accurately reflects the paper's content, and we appreciate the positive assessment of the domain-specific pre-training and code release. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical competition rankings (3rd classification, 2nd regression) achieved by a described system of Twitter LM pretraining, transfer learning, and label smoothing. No mathematical derivation, equations, or fitted-parameter predictions are present; the central claims are factual statements about external challenge results and are supported by a public GitHub repository. No self-citation chains, self-definitional steps, or reductions of outputs to inputs by construction exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities beyond standard machine learning practices; the work is an empirical application report based on the abstract.

pith-pipeline@v0.9.0 · 5622 in / 895 out tokens · 21244 ms · 2026-05-25T01:26:14.416740+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Quasi-Recurrent Neural Networks

    Bradbury, J., Merity, S., Xiong, C., Socher, R.: Quasi-recurrent neural networks. CoRR abs/1611.01576 (2016), http://arxiv.org/abs/1611.01576

  2. [2]

    In: Proceedings of the Sixth International Work- shop on Natural Language Processing for Social Media

    Castro, S., Chiruzzo, L., Ros´ a, A., Garat, D., Moncecchi, G.: A crowd-annotated spanish corpus for humor analysis. In: Proceedings of the Sixth International Work- shop on Natural Language Processing for Social Media. pp. 7–11 (2018) 9 http://forums.fasta.ai

  3. [3]

    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (Jun 2002), http://dl.acm.org/citation.cfm?id=1622407.1622416

  4. [4]

    In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)

    Chiruzzo, L., Castro, S., Etcheverry, M., Garat, D., Prada, J.J., Ros´ a, A.: Overview of HAHA at IberLEF 2019: Humor Analysis based on Human Annotation. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). CEUR Workshop Proceedings, CEUR-WS, Bilbao, Spain (9 2019)

  5. [5]

    Universal Language Model Fine-Tuning with Subword Tokenization for Polish

    Czapla, P., Howard, J., Kardas, M.: Universal language model fine-tuning with subword tokenization for polish. CoRR abs/1810.10222 (2018), http://arxiv. org/abs/1810.10222

  6. [6]

    Universal Language Model Fine-tuning for Text Classification

    Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. CoRR abs/1801.06146 (2018), http://arxiv.org/abs/1801.06146

  7. [7]

    Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Kudo, T., Richardson, J.: Sentencepiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. CoRRabs/1808.06226 (2018), http://arxiv.org/abs/1808.06226

  8. [8]

    Regularizing and Optimizing LSTM Language Models

    Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. CoRR abs/1708.02182 (2017), http://arxiv.org/abs/1708.02182

  9. [9]

    Regularizing Neural Networks by Penalizing Confident Output Distributions

    Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.E.: Regularizing neu- ral networks by penalizing confident output distributions. CoRRabs/1701.06548 (2017), http://arxiv.org/abs/1701.06548

  10. [10]

    Using the Output Embedding to Improve Language Models

    Press, O., Wolf, L.: Using the output embedding to improve language models. CoRR abs/1608.05859 (2016), http://arxiv.org/abs/1608.05859

  11. [11]

    A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

    Smith, L.N.: A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. CoRR abs/1803.09820 (2018), http://arxiv.org/abs/1803.09820

  12. [12]

    In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

    Wang, S., Manning, C.D.: Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2. pp. 90–94. ACL ’12, Association for Computational Linguistics, Stroudsburg, PA, USA (2012), http: //dl.acm.org/citation.cfm?id=2390665.2390688