Distributed Representations of Sentences and Documents

Quoc V. Le; Tomas Mikolov

arxiv: 1405.4053 · v2 · pith:W67U2AU7new · submitted 2014-05-16 · 💻 cs.CL · cs.AI· cs.LG

Distributed Representations of Sentences and Documents

Quoc V. Le , Tomas Mikolov This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords bag-of-wordsalgorithmfixed-lengthrepresentationsvectorwordsdocumentdocuments

0 comments

read the original abstract

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Computation Time for Recurrent Neural Networks
cs.NE 2016-03 accept novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Language Models as Knowledge Bases?
cs.CL 2019-09 accept novelty 7.0

BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.
Medical Concept Representation Learning from Claims Data and Application to Health Plan Payment Risk Adjustment
cs.LG 2019-07 unverdicted novelty 4.0

Embedding models trained on medical codes outperform a commercial linear regression risk adjustment model for prospective risk score prediction.
OCC: A Smart Reply System for Efficient In-App Communications
cs.CL 2019-07 unverdicted novelty 3.0

Uber's OCC system uses unsupervised embeddings plus nearest-neighbor intent detection followed by historical reply retrieval, reporting 76% intent accuracy and 71% production usage in English-speaking countries.