Distributed Representations of Words and Phrases and their Compositionality
read the original abstract
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
This paper has not been read by Pith yet.
Forward citations
Cited by 18 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
SimCSE: Simple Contrastive Learning of Sentence Embeddings
SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
-
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining
MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme lo...
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
A Multi-head-based architecture for effective morphological tagging in Russian with open dictionary
A multi-head attention model for Russian morphological tagging supports open dictionaries via subtoken splitting and reports 98-99% accuracy on grammatical categories while running efficiently on consumer hardware.
-
ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering
ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.
-
Semantic knowledge guides innovation and drives cultural evolution
Semantic knowledge directs exploration to meaningful innovations, improves success rates, enables generalization, and synergizes with social learning to accelerate cumulative culture, as tested in agent-based models a...
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
-
Realised Volatility Forecasting: Machine Learning via Financial Word Embedding
News embeddings from financial text improve out-of-sample realized volatility forecasts for stocks, with stronger effects for stock-specific news and high-volatility periods, and yield gains when combined with benchmarks.
-
Toward General and Robust LLM-enhanced Text-attributed Graph Learning
UltraTAG organizes LLM-GNN methods for text-attributed graphs; UltraTAG-S adds LLM text propagation, augmentation, PageRank node selection, and edge reconfiguration to improve robustness on sparse data, with reported ...
-
Utilizing Pre-trained and Large Language Models for 10-K Items Segmentation
BERT4ItemSeg reaches macro-F1 of 0.9825 on core 10-K items across 3,737 annotated reports, outperforming GPT4ItemSeg (0.9567) and baselines.
-
From Node2Vec to GPT-based GraphRAG: scientific impact prediction across graph and language models
Directed citation graphs plus textual embeddings reach 0.84-0.85 AUC for top-P% impact classification while GPT-5.5/5.4 Nano prompts hit 0.87 but show no consistent gain from retrieved graph neighborhoods over target-...
-
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance ...
-
DeepTrax: Embedding Graphs of Financial Transactions
DeepTrax learns embeddings for accounts and merchants in financial transaction graphs via methods inspired by standard graph embedding techniques, reporting strong link prediction performance and utility in fraud dete...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.