Universal Language Model Fine-Tuning with Subword Tokenization for Polish

Jeremy Howard; Marcin Kardas; Piotr Czapla

arxiv: 1810.10222 · v1 · pith:Z22J7HCRnew · submitted 2018-10-24 · 💻 cs.CL · cs.LG· stat.ML

Universal Language Model Fine-Tuning with Subword Tokenization for Polish

Piotr Czapla , Jeremy Howard , Marcin Kardas This is my paper

classification 💻 cs.CL cs.LGstat.ML

keywords modellanguagefine-tuningfirstpolishresultssubwordtokenization

0 comments

read the original abstract

Universal Language Model for Fine-tuning [arXiv:1801.06146] (ULMFiT) is one of the first NLP methods for efficient inductive transfer learning. Unsupervised pretraining results in improvements on many NLP tasks for English. In this paper, we describe a new method that uses subword tokenization to adapt ULMFiT to languages with high inflection. Our approach results in a new state-of-the-art for the Polish language, taking first place in Task 3 of PolEval'18. After further training, our final model outperformed the second best model by 35%. We have open-sourced our pretrained models and code.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction
cs.CL 2019-07 unverdicted novelty 3.0

A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.