word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

Omer Levy; Yoav Goldberg

arxiv: 1402.3722 · v1 · pith:2PHN6JN4new · submitted 2014-02-15 · 💻 cs.CL · cs.LG· stat.ML

word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

Yoav Goldberg , Omer Levy This is my paper

classification 💻 cs.CL cs.LGstat.ML

keywords mikolovword2vecbehindmodelssoftwaretomasattemptchen

0 comments

read the original abstract

The word2vec software of Tomas Mikolov and colleagues (https://code.google.com/p/word2vec/ ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations. This note is an attempt to explain equation (4) (negative sampling) in "Distributed Representations of Words and Phrases and their Compositionality" by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Linear Representation Hypothesis and the Geometry of Large Language Models
cs.CL 2023-11 conditional novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
Language Models as Knowledge Bases?
cs.CL 2019-09 accept novelty 7.0

BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.
Flatter is better: Percentile Transformations for Recommender Systems
cs.IR 2019-07 unverdicted novelty 6.0

Percentile-based rating transformation flattens user rating distributions and improves recommendation ranking performance on four real-world datasets.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
Not All Bugs Are the Same: Understanding, Characterizing, and Classifying the Root Cause of Bugs
cs.SE 2019-07 unverdicted novelty 5.0

Manual analysis of 1,280 bug reports across three ecosystems produces a nine-category root cause taxonomy; an ML classifier achieves 64% F-Measure and 74% AUC-ROC overall.
Simplex2Vec embeddings for community detection in simplicial complexes
physics.soc-ph 2019-06 unverdicted novelty 5.0

Simplex2Vec embeddings are used to compute and visualize community structures in simplicial complexes, with tests on synthetic data and applications to social and brain datasets showing benefits from higher-order inte...
Universal Time-Series Representation Learning: A Survey
cs.LG 2024-01 unverdicted novelty 3.0

A survey that proposes a taxonomy for universal time-series representation learning and reviews existing deep learning studies along with experimental setups.
Neural or Statistical: An Empirical Study on Language Models for Chinese Input Recommendation on Mobile
cs.CL 2019-07 unverdicted novelty 3.0

Empirical comparison on Chinese mobile input shows n-gram and neural models have complementary strengths, with their hybrid delivering significant gains over single approaches.