word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method
read the original abstract
The word2vec software of Tomas Mikolov and colleagues (https://code.google.com/p/word2vec/ ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations. This note is an attempt to explain equation (4) (negative sampling) in "Distributed Representations of Words and Phrases and their Compositionality" by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean.
This paper has not been read by Pith yet.
Forward citations
Cited by 8 Pith papers
-
The Linear Representation Hypothesis and the Geometry of Large Language Models
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
-
Language Models as Knowledge Bases?
BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.
-
Flatter is better: Percentile Transformations for Recommender Systems
Percentile-based rating transformation flattens user rating distributions and improves recommendation ranking performance on four real-world datasets.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
-
Not All Bugs Are the Same: Understanding, Characterizing, and Classifying the Root Cause of Bugs
Manual analysis of 1,280 bug reports across three ecosystems produces a nine-category root cause taxonomy; an ML classifier achieves 64% F-Measure and 74% AUC-ROC overall.
-
Simplex2Vec embeddings for community detection in simplicial complexes
Simplex2Vec embeddings are used to compute and visualize community structures in simplicial complexes, with tests on synthetic data and applications to social and brain datasets showing benefits from higher-order inte...
-
Universal Time-Series Representation Learning: A Survey
A survey that proposes a taxonomy for universal time-series representation learning and reviews existing deep learning studies along with experimental setups.
-
Neural or Statistical: An Empirical Study on Language Models for Chinese Input Recommendation on Mobile
Empirical comparison on Chinese mobile input shows n-gram and neural models have complementary strengths, with their hybrid delivering significant gains over single approaches.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.