An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
pith:P2Y4C3AN Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{P2Y4C3AN}
Prints a linked pith:P2Y4C3AN badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
read the original abstract
Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec (Mikolov et al., 2013a) to learn document-level embeddings. Despite promising results in the original paper, others have struggled to reproduce those results. This paper presents a rigorous empirical evaluation of doc2vec over two tasks. We compare doc2vec to two baselines and two state-of-the-art document embedding methodologies. We found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings. We also provide recommendations on hyper-parameter settings for general purpose applications, and release source code to induce document embeddings using our trained doc2vec models.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
No Data? No Problem: Synthesizing Security Graphs for Better Intrusion Detection
PROVSYN synthesizes high-fidelity security provenance graphs via graph generation and LLMs to augment imbalanced datasets, improving downstream APT detection accuracy by up to 38% on benchmarks.
-
Optimising for the long game: methodological challenges in energy system optimisation pathways
A systematic review of energy system optimization pathways identifies foresight choices, end effects, resolution trade-offs, and investment dynamics as key methodological challenges and recommends improvements to avoi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.