Learning Word Embeddings from the Portuguese Twitter Stream: A Study of some Practical Aspects

Carlos Soares; Eduarda Mendes Rodrigues; Eug\'enio Oliveira; Lu\'is Sarmento; Pedro Saleiro

arxiv: 1709.00947 · v1 · pith:WRXQI3MRnew · submitted 2017-09-04 · 💻 cs.CL · cs.LG

Learning Word Embeddings from the Portuguese Twitter Stream: A Study of some Practical Aspects

Pedro Saleiro , Lu\'is Sarmento , Eduarda Mendes Rodrigues , Carlos Soares , Eug\'enio Oliveira This is my paper

classification 💻 cs.CL cs.LG

keywords trainingsizevocabularyevaluationexamplesintrinsicmetricswords

0 comments

read the original abstract

This paper describes a preliminary study for producing and distributing a large-scale database of embeddings from the Portuguese Twitter stream. We start by experimenting with a relatively small sample and focusing on three challenges: volume of training data, vocabulary size and intrinsic evaluation metrics. Using a single GPU, we were able to scale up vocabulary size from 2048 words embedded and 500K training examples to 32768 words over 10M training examples while keeping a stable validation loss and approximately linear trend on training time per epoch. We also observed that using less than 50\% of the available training examples for each vocabulary size might result in overfitting. Results on intrinsic evaluation show promising performance for a vocabulary size of 32768 words. Nevertheless, intrinsic evaluation metrics suffer from over-sensitivity to their corresponding cosine similarity thresholds, indicating that a wider range of metrics need to be developed to track progress.

This paper has not been read by Pith yet.

Learning Word Embeddings from the Portuguese Twitter Stream: A Study of some Practical Aspects

discussion (0)