Efficient Estimation of Word Representations in Vector Space

Greg Corrado; Jeffrey Dean; Kai Chen; Tomas Mikolov

arxiv: 1301.3781 · v3 · submitted 2013-01-16 · 💻 cs.CL

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov , Kai Chen , Greg Corrado , Jeffrey Dean This is my paper

Pith reviewed 2026-05-11 02:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords word vectorsvector space modelsneural networksskip-gramcontinuous bag-of-wordsword similaritysyntactic analogiessemantic relationships

0 comments

The pith

Two new neural network architectures learn continuous vector representations of words from massive text data with higher accuracy and far lower training cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two model architectures for learning word vectors from very large datasets. These architectures are evaluated on word similarity tasks and compared against earlier neural network techniques. They achieve large gains in accuracy while training high-quality vectors on a 1.6 billion word corpus in less than a day. This matters for applications that rely on representations capturing syntactic and semantic word relationships.

Core claim

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

What carries the argument

The continuous bag-of-words and skip-gram architectures, shallow neural networks trained to predict surrounding words from a target word or the target word from its context to derive dense vector representations.

Load-bearing premise

That performance on the chosen word similarity and analogy test sets reliably indicates that the vectors capture general syntactic and semantic relationships rather than dataset-specific patterns.

What would settle it

Training the models on the 1.6 billion word dataset and finding no accuracy gain on the syntactic and semantic test sets relative to prior neural network methods, or requiring substantially more computation time to reach comparable performance.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mikolov et al. show two simple neural architectures that train word vectors on a 1.6B-word corpus in under a day while beating prior models on similarity and analogy tests.

read the letter

The main takeaway is that CBOW and Skip-gram give a practical way to get dense word vectors from web-scale text with clear speed and accuracy gains over the neural language models they compare against. They train on 1.6 billion words in less than a day and report better numbers on word similarity plus a new analogy task for syntax and semantics. That efficiency claim is the part that stands out, because earlier neural approaches were too slow for that scale. The architectures themselves are straightforward extensions of existing ideas but packaged for large data, and the results show Skip-gram picking up more semantic relations while CBOW is faster. Credit is due for the concrete training-time numbers and for releasing the models in a way that let others reproduce and build on them quickly. The soft spot is the evaluation: all headline numbers come from the chosen similarity datasets and the authors' own analogy set. Those proxies are reasonable for the time, but nothing in the paper tests whether the vectors transfer to downstream tasks like tagging or parsing, so the claim of generally high-quality representations rests on how well those benchmarks predict real use. Hyperparameter details and exact baseline re-implementations are also light in the write-up, which makes it harder to rule out tuning effects. This paper is aimed at anyone who needs scalable word representations for NLP work. A reader who wants to understand the shift from count-based to predictive embeddings will find the methods and trade-offs useful. It is worth sending to peer review because the empirical results are specific enough to check and the scaling improvement is real even if the broader interpretation needs more validation.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes two novel neural network architectures (Continuous Bag-of-Words and Skip-gram) for learning continuous vector representations of words from very large corpora. It evaluates these representations on word similarity tasks against prior neural methods and introduces an analogy task for syntactic and semantic relations, claiming substantially higher accuracy at far lower computational cost, including training high-quality vectors on a 1.6 billion word dataset in less than a day.

Significance. If the reported accuracy gains and training-time reductions hold under scrutiny, the work is significant for establishing practical, scalable methods to produce high-quality word embeddings. The efficiency stems from the architectural simplifications and use of hierarchical softmax, enabling training on billion-word scales that were previously prohibitive. This has provided a foundation for subsequent embedding techniques and downstream NLP improvements.

major comments (2)

[§4] §4 (Experimental results): The central efficiency claim rests on the reported training time (<1 day on 1.6B words) and accuracy improvements versus prior neural baselines, but the section provides insufficient detail on exact baseline re-implementations, hyperparameter search procedures, and whether the same hardware/resources were used for all methods. This makes it difficult to confirm the comparisons are free of post-hoc tuning.
[§4.2] §4.2 (Evaluation on word analogy task): The state-of-the-art claim is made on a test set introduced by the authors themselves. While the task is a useful contribution, the manuscript does not include results on independent downstream tasks (e.g., named entity recognition or machine translation) or cross-corpus validation to support the broader interpretation that the vectors capture general syntactic and semantic relationships.

minor comments (3)

[§2] §2 (Model architectures): The notation for the input/output layers and context window could be clarified with an explicit equation for the CBOW averaging operation to avoid ambiguity in implementation.
[Table 1, Figure 2] Table 1 and Figure 2: The reported accuracy numbers and training times would benefit from error bars or multiple runs to indicate variability, especially given the stochastic nature of the training.
[References] References: The comparison to prior work (e.g., neural language models by Bengio et al.) could include a more explicit discussion of why the proposed models avoid the computational bottlenecks of those approaches.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive review and constructive comments. We address each major comment below and will make the indicated revisions to improve clarity and transparency.

read point-by-point responses

Referee: [§4] §4 (Experimental results): The central efficiency claim rests on the reported training time (<1 day on 1.6B words) and accuracy improvements versus prior neural baselines, but the section provides insufficient detail on exact baseline re-implementations, hyperparameter search procedures, and whether the same hardware/resources were used for all methods. This makes it difficult to confirm the comparisons are free of post-hoc tuning.

Authors: We agree that greater detail on the experimental setup would strengthen the comparisons. In the revised manuscript we will expand §4 with additional information on the re-implementations of the prior neural baselines, the hyperparameter ranges explored for each method, and explicit confirmation that all timing and accuracy measurements were performed under comparable hardware and resource constraints. revision: yes
Referee: [§4.2] §4.2 (Evaluation on word analogy task): The state-of-the-art claim is made on a test set introduced by the authors themselves. While the task is a useful contribution, the manuscript does not include results on independent downstream tasks (e.g., named entity recognition or machine translation) or cross-corpus validation to support the broader interpretation that the vectors capture general syntactic and semantic relationships.

Authors: The analogy task was introduced in this work precisely to probe syntactic and semantic relations in a controlled manner. While we recognize that evaluations on downstream tasks would provide further support, the scope of the paper centers on efficient learning of high-quality vectors and direct assessment via the new task. We will add a short discussion in the revised version acknowledging this limitation and outlining how the vectors could be applied to downstream problems. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical claims rest on independent external benchmarks

full rationale

The paper defines CBOW and Skip-gram models via explicit objective functions (Eqs. 1-4) trained on raw text corpora, then measures vector quality solely on held-out similarity datasets (WordSim-353) and a newly constructed analogy test set. These evaluation sets are not constructed from the fitted parameters or training objective, nor do any central claims reduce to self-citation or renaming of inputs. The reported accuracy gains and computational savings are direct empirical outcomes against external references, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard neural-network training assumptions plus the empirical claim that similarity-task performance measures semantic quality. No new physical entities or ad-hoc constants beyond ordinary hyperparameters are introduced.

free parameters (2)

vector dimensionality
Hyperparameter chosen by the authors; value not stated in abstract.
context window size
Hyperparameter controlling how many surrounding words are used.

axioms (2)

domain assumption Back-propagation through a single hidden layer produces useful word vectors when trained on next-word or context prediction.
Invoked implicitly by proposing the architectures.
domain assumption Word similarity and analogy test sets are valid proxies for syntactic and semantic understanding.
Used to claim state-of-the-art performance.

pith-pipeline@v0.9.0 · 5383 in / 1332 out tokens · 30340 ms · 2026-05-11T02:16:02.313356+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
REALM: Retrieval-Augmented Language Model Pre-Training
cs.CL 2020-02 accept novelty 8.0

REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.
Intriguing properties of neural networks
cs.CV 2013-12 accept novelty 8.0

Deep neural networks exhibit distributed high-level semantic representations and discontinuous input-output mappings vulnerable to transferable adversarial perturbations.
Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
Differentially Private Sampling from Distributions via Wasserstein Projection
stat.ML 2026-05 unverdicted novelty 7.0

Proposes Wasserstein Projection Mechanism for differentially private sampling that optimizes Wasserstein distance utility and provides convergence guarantees for approximate computation.
OZ-TAL: Online Zero-Shot Temporal Action Localization
cs.CV 2026-05 unverdicted novelty 7.0

Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
An Experimental Method to Study Opinion Diffusion in Human-AI Hybrid Societies
cs.SI 2026-05 unverdicted novelty 7.0

Hybrid human-AI networks in 5x5 grids reached lower final polarization than human-only networks after eight rounds of opinion revision on polarizing topics.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
cs.CV 2026-05 unverdicted novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
Expressiveness Limits of Autoregressive Semantic ID Generation in Generative Recommendation
cs.IR 2026-05 unverdicted novelty 7.0

Autoregressive semantic ID generation creates tree-induced probability correlations that prevent generative recommenders from capturing simple patterns; Latte adds latent tokens to relax these correlations.
Rational Communication Shapes Morphological Composition
cs.CL 2026-05 unverdicted novelty 7.0

Using historical corpora and the Rational Speech Act framework, attested English morphological compositions are ranked higher than plausible alternatives from the same time period when both semantic recoverability and...
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
cs.CL 2026-05 unverdicted novelty 7.0

EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
Identifying and Characterizing Semantic Clones of Solidity Functions
cs.SE 2026-04 unverdicted novelty 7.0

A code-and-comment analysis method detects semantic clones in Solidity functions with 59% overall precision (84% for same-name functions) and 97% recall on 300k contracts, plus LLM summaries for uncommented code.
Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training
cs.LG 2026-04 unverdicted novelty 7.0

TabGRAA enables self-improving tabular language models through iterative group-relative advantage alignment using modular automated quality signals like distinguishability classifiers.
Beyond Nodes vs. Edges: A Multi-View Fusion Framework for Provenance-Based Intrusion Detection
cs.CR 2026-04 unverdicted novelty 7.0

PROVFUSION fuses three complementary views of provenance data with lightweight schemes and voting to achieve higher detection accuracy and lower false positives than node- or edge-only baselines on nine benchmarks.
Learning to Discover at Test Time
cs.LG 2026-01 unverdicted novelty 7.0

TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
GRAB: A Risk Taxonomy--Grounded Benchmark for Unsupervised Topic Discovery in Financial Disclosures
cs.CL 2025-09 unverdicted novelty 7.0

GRAB is a benchmark dataset of 1.61M sentences from 8,247 10-K filings with taxonomy-anchored weak supervision labels for standardized evaluation of unsupervised topic models on financial risk disclosures.
Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends
cs.CR 2025-08 accept novelty 7.0

A survey of LLM copyright protection that unifies text watermarking, model watermarking, and model fingerprinting while presenting new coverage of fingerprint transfer and removal.
Adversarial Video Promotion Against Text-to-Video Retrieval
cs.CV 2025-08 unverdicted novelty 7.0

Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
cs.CV 2024-10 conditional novelty 7.0

VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distributi...
A Simple Framework for Contrastive Learning of Visual Representations
cs.LG 2020-02 accept novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
cs.CL 2019-10 accept novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
Language Models as Knowledge Bases?
cs.CL 2019-09 accept novelty 7.0

BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.
Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation
cs.CL 2019-07 unverdicted novelty 7.0

The paper releases the first multimodal English-Hindi machine translation dataset of 31,525 segments with images and a challenge test set of 1,400 segments selected via embedding similarity for image-resolvable ambiguities.
Tight Sensitivity Bounds For Smaller Coresets
cs.LG 2019-07 unverdicted novelty 7.0

New algorithms compute provably tight sensitivity bounds for matrix rows, yielding smaller coresets for LMS approximation of affine k-subspaces via an iterative exact method and a dimensionality-reduction trick.
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation
cs.CL 2026-05 conditional novelty 6.0

LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
DEL: Digit Entropy Loss for Numerical Learning of Large Language Models
cs.CL 2026-05 conditional novelty 6.0

DEL is a new loss for LLM numerical learning that applies supervised digit entropy optimization and extends to floating-point numbers, showing improved accuracy and distance metrics over prior methods on math benchmarks.
A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
cs.CL 2026-05 unverdicted novelty 6.0

MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
PipeANN-Filter: An Efficient Filtered Vector Search System on SSD
cs.OS 2026-05 unverdicted novelty 6.0

PipeANN-Filter improves filtered vector search latency and throughput on SSD by exploring a superset of valid vectors identified via probabilistic filters and verifying attributes only after selecting top-k candidates.
Multi-agent AI systems outperform human teams in creativity
cs.CL 2026-05 unverdicted novelty 6.0

Multi-agent LLM teams outperform human teams in creativity (d=1.50) across tasks by producing more novel ideas, with distinct semantic exploration patterns predicting success for each group.
Polar probe linearly decodes semantic structures from LLMs
cs.CL 2026-05 unverdicted novelty 6.0

LLMs represent semantic relations geometrically via embedding distance and direction; a linear Polar Probe decodes these structures from middle-layer activations and generalizes to new entities.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry
cs.LG 2026-05 unverdicted novelty 6.0

Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...
Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes
cs.CL 2026-05 unverdicted novelty 6.0

Fixed 16-bit binary token codes can replace trainable input embeddings in 32-layer decoder-only models while maintaining comparable held-out perplexity on 17B tokens.
Semantic Smoothing for Language Models via Distribution Estimation and Embeddings
cs.IT 2026-05 conditional novelty 6.0

Semantic smoothing formulates next-word distribution estimation under KL loss with embedding-based KL-proximity side information, yielding an interpolation estimator with worst-case risk O(min{Δ, d/n}) that empiricall...
TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
cs.CV 2026-05 unverdicted novelty 6.0

TAS-LoRA attaches a mixture of LoRA experts to a supernet and uses a dynamic router plus group-wise initialization to let different architecture subnets learn distinct features, yielding higher accuracy than prior TAS...
Query-efficient model evaluation using cached responses
cs.LG 2026-05 unverdicted novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks
cs.LG 2026-05 unverdicted novelty 6.0

Gradient descent in deep networks implicitly drives features toward target-linear structure as captured by the weight Gram matrix and a derived virtual covariance.
When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge
cs.DL 2026-05 unverdicted novelty 6.0

Post-2015 AI adoption in science grew exponentially across domains but stayed limited to CS-linked topics, carried citation premiums, higher retractions, and showed rising Asian middle-income country involvement.
When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge
cs.DL 2026-05 unverdicted novelty 6.0

AI adoption in science has shown exponential growth since 2015 across domains but stays confined to few CS-linked topics, carries citation premiums, higher retraction rates, and uneven geographic spread, leaving its t...
When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge
cs.DL 2026-05 unverdicted novelty 6.0

AI use in science has grown exponentially since 2015 but stays confined to computer science and statistics topics, shows higher retraction rates and citations, and follows distinct global adoption patterns.
A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks
cs.LG 2026-05 unverdicted novelty 6.0

A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised no...
Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch
cs.DS 2026-05 unverdicted novelty 6.0

Triplet constraints realizable in D-dimensional Euclidean space cannot be preserved above 50% accuracy by any embedding of dimension at most cD for constant c<1, with UGC-hardness preventing better polynomial-time sol...
Deep Kernel Learning for Stratifying Glaucoma Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

A deep kernel learning architecture with transformer feature extraction on clinical-BERT embeddings and Gaussian process backend identifies three glaucoma subgroups by decoupling progression trajectories from current ...
The TEA Nets framework combines AI and cognitive network science to model targets, events and actors in text
cs.AI 2026-04 unverdicted novelty 6.0

TEA Nets extracts agents, events, and targets from text to reveal emotional and semantic patterns in conspiracy theories and psychotherapy transcripts from humans and LLMs.
ImproBR: Bug Report Improver Using LLMs
cs.SE 2026-04 unverdicted novelty 6.0

ImproBR combines a hybrid detector with GPT-4o mini and RAG to raise bug report structural completeness from 7.9% to 96.4% and executable steps from 28.8% to 67.6% on 139 Mojira reports.
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...
Self-supervised pretraining for an iterative image size agnostic vision transformer
cs.CV 2026-04 unverdicted novelty 6.0

A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
Context-Aware Search and Retrieval Under Token Erasure
cs.IR 2026-04 unverdicted novelty 6.0

Assigning higher redundancy to semantically important query features reduces retrieval error probability under token erasures, via multivariate Gaussian approximations of similarity margins and supporting numerical results.
Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models
cs.CV 2026-04 unverdicted novelty 6.0

Embedding Arithmetic performs vector operations in the embedding space of T2I models to mitigate bias at inference time, outperforming baselines on diversity while preserving coherence via a new Concept Coherence Score.
Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation
cs.CL 2026-04 unverdicted novelty 6.0

LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
cs.CL 2026-04 unverdicted novelty 6.0

REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
cs.CV 2026-04 unverdicted novelty 6.0

SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
AFGNN: API Misuse Detection using Graph Neural Networks and Clustering
cs.SE 2026-04 unverdicted novelty 6.0

AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
cs.LG 2026-04 unverdicted novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Detecting RAG Advertisements Across Advertising Styles
cs.IR 2026-03 unverdicted novelty 6.0

Entity recognition models detect ads in RAG responses effectively and stay robust when advertisers switch styles, while lightweight models like random forests and SVMs become brittle under the same changes.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 147 Pith papers

[1]

Bengio, R

Y . Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Ma- chine Learning Research, 3:1137-1155, 2003

work page 2003
[2]

Bengio, Y

Y . Bengio, Y . LeCun. Scaling learning algorithms towards AI. In: Large-Scale Kernel Ma- chines, MIT Press, 2007

work page 2007
[3]

Brants, A

T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, 2007

work page 2007
[4]

Collobert and J

R. Collobert and J. Weston. A Uniﬁed Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In International Conference on Machine Learning, ICML, 2008

work page 2008
[5]

Collobert, J

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Lan- guage Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493- 2537, 2011

work page 2011
[6]

Dean, G.S

J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V . Le, M.Z. Mao, M.A. Ranzato, A. Senior, P. Tucker, K. Yang, A. Y . Ng., Large Scale Distributed Deep Networks, NIPS, 2012

work page 2012
[7]

Duchi, E

J.C. Duchi, E. Hazan, and Y . Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011

work page 2011
[8]

J. Elman. Finding Structure in Time. Cognitive Science, 14, 179-211, 1990

work page 1990
[9]

Huang, R

Eric H. Huang, R. Socher, C. D. Manning and Andrew Y . Ng. Improving Word Representations via Global Context and Multiple Word Prototypes. In: Proc. Association for Computational Linguistics, 2012

work page 2012
[10]

Hinton, J.L

G.E. Hinton, J.L. McClelland, D.E. Rumelhart. Distributed representations. In: Parallel dis- tributed processing: Explorations in the microstructure of cognition. V olume 1: Foundations, MIT Press, 1986

work page 1986
[11]

Jurgens, S.M

D.A. Jurgens, S.M. Mohammad, P.D. Turney, K.J. Holyoak. Semeval-2012 task 2: Measuring degrees of relational similarity. In: Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), 2012

work page 2012
[12]

Maas, R.E

A.L. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y . Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of ACL, 2011

work page 2011
[13]

T. Mikolov. Language Modeling for Speech Recognition in Czech, Masters thesis, Brno Uni- versity of Technology, 2007

work page 2007
[14]

Mikolov, J

T. Mikolov, J. Kopeck ´y, L. Burget, O. Glembek and J. ˇCernock´y. Neural network based lan- guage models for higly inﬂective languages, In: Proc. ICASSP 2009

work page 2009
[15]

Mikolov, M

T. Mikolov, M. Karaﬁ ´at, L. Burget, J. ˇCernock´y, S. Khudanpur. Recurrent neural network based language model, In: Proceedings of Interspeech, 2010

work page 2010
[16]

Mikolov, S

T. Mikolov, S. Kombrink, L. Burget, J. ˇCernock´y, S. Khudanpur. Extensions of recurrent neural network language model, In: Proceedings of ICASSP 2011

work page 2011
[17]

Mikolov, A

T. Mikolov, A. Deoras, S. Kombrink, L. Burget, J. ˇCernock´y. Empirical Evaluation and Com- bination of Advanced Language Modeling Techniques, In: Proceedings of Interspeech, 2011. 4The code is available at https://code.google.com/p/word2vec/ 11

work page 2011
[18]

Mikolov, A

T. Mikolov, A. Deoras, D. Povey, L. Burget, J. ˇCernock´y. Strategies for Training Large Scale Neural Network Language Models, In: Proc. Automatic Speech Recognition and Understand- ing, 2011

work page 2011
[19]

T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno Univer- sity of Technology, 2012

work page 2012
[20]

Mikolov, W.T

T. Mikolov, W.T. Yih, G. Zweig. Linguistic Regularities in Continuous Space Word Represen- tations. NAACL HLT 2013

work page 2013
[21]

Mikolov, I

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. Accepted to NIPS 2013

work page 2013
[22]

A. Mnih, G. Hinton. Three new graphical models for statistical language modelling. ICML, 2007

work page 2007
[23]

A. Mnih, G. Hinton. A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems 21, MIT Press, 2009

work page 2009
[24]

Mnih, Y .W

A. Mnih, Y .W. Teh. A fast and simple algorithm for training neural probabilistic language models. ICML, 2012

work page 2012
[25]

Morin, Y

F. Morin, Y . Bengio. Hierarchical Probabilistic Neural Network Language Model. AISTATS, 2005

work page 2005
[26]

D. E. Rumelhart, G. E. Hinton, R. J. Williams. Learning internal representations by back- propagating errors. Nature, 323:533.536, 1986

work page 1986
[27]

H. Schwenk. Continuous space language models. Computer Speech and Language, vol. 21, 2007

work page 2007
[28]

Socher, E.H

R. Socher, E.H. Huang, J. Pennington, A.Y . Ng, and C.D. Manning. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS, 2011

work page 2011
[29]

Turian, L

J. Turian, L. Ratinov, Y . Bengio. Word Representations: A Simple and General Method for Semi-Supervised Learning. In: Proc. Association for Computational Linguistics, 2010

work page 2010
[30]

P. D. Turney. Measuring Semantic Similarity by Latent Relational Analysis. In: Proc. Interna- tional Joint Conference on Artiﬁcial Intelligence, 2005

work page 2005
[31]

Zhila, W.T

A. Zhila, W.T. Yih, C. Meek, G. Zweig, T. Mikolov. Combining Heterogeneous Models for Measuring Relational Similarity. NAACL HLT 2013

work page 2013
[32]

Zweig, C.J.C

G. Zweig, C.J.C. Burges. The Microsoft Research Sentence Completion Challenge, Microsoft Research Technical Report MSR-TR-2011-129, 2011. 12

work page 2011

[1] [1]

Bengio, R

Y . Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Ma- chine Learning Research, 3:1137-1155, 2003

work page 2003

[2] [2]

Bengio, Y

Y . Bengio, Y . LeCun. Scaling learning algorithms towards AI. In: Large-Scale Kernel Ma- chines, MIT Press, 2007

work page 2007

[3] [3]

Brants, A

T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, 2007

work page 2007

[4] [4]

Collobert and J

R. Collobert and J. Weston. A Uniﬁed Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In International Conference on Machine Learning, ICML, 2008

work page 2008

[5] [5]

Collobert, J

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Lan- guage Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493- 2537, 2011

work page 2011

[6] [6]

Dean, G.S

J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V . Le, M.Z. Mao, M.A. Ranzato, A. Senior, P. Tucker, K. Yang, A. Y . Ng., Large Scale Distributed Deep Networks, NIPS, 2012

work page 2012

[7] [7]

Duchi, E

J.C. Duchi, E. Hazan, and Y . Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011

work page 2011

[8] [8]

J. Elman. Finding Structure in Time. Cognitive Science, 14, 179-211, 1990

work page 1990

[9] [9]

Huang, R

Eric H. Huang, R. Socher, C. D. Manning and Andrew Y . Ng. Improving Word Representations via Global Context and Multiple Word Prototypes. In: Proc. Association for Computational Linguistics, 2012

work page 2012

[10] [10]

Hinton, J.L

G.E. Hinton, J.L. McClelland, D.E. Rumelhart. Distributed representations. In: Parallel dis- tributed processing: Explorations in the microstructure of cognition. V olume 1: Foundations, MIT Press, 1986

work page 1986

[11] [11]

Jurgens, S.M

D.A. Jurgens, S.M. Mohammad, P.D. Turney, K.J. Holyoak. Semeval-2012 task 2: Measuring degrees of relational similarity. In: Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), 2012

work page 2012

[12] [12]

Maas, R.E

A.L. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y . Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of ACL, 2011

work page 2011

[13] [13]

T. Mikolov. Language Modeling for Speech Recognition in Czech, Masters thesis, Brno Uni- versity of Technology, 2007

work page 2007

[14] [14]

Mikolov, J

T. Mikolov, J. Kopeck ´y, L. Burget, O. Glembek and J. ˇCernock´y. Neural network based lan- guage models for higly inﬂective languages, In: Proc. ICASSP 2009

work page 2009

[15] [15]

Mikolov, M

T. Mikolov, M. Karaﬁ ´at, L. Burget, J. ˇCernock´y, S. Khudanpur. Recurrent neural network based language model, In: Proceedings of Interspeech, 2010

work page 2010

[16] [16]

Mikolov, S

T. Mikolov, S. Kombrink, L. Burget, J. ˇCernock´y, S. Khudanpur. Extensions of recurrent neural network language model, In: Proceedings of ICASSP 2011

work page 2011

[17] [17]

Mikolov, A

T. Mikolov, A. Deoras, S. Kombrink, L. Burget, J. ˇCernock´y. Empirical Evaluation and Com- bination of Advanced Language Modeling Techniques, In: Proceedings of Interspeech, 2011. 4The code is available at https://code.google.com/p/word2vec/ 11

work page 2011

[18] [18]

Mikolov, A

T. Mikolov, A. Deoras, D. Povey, L. Burget, J. ˇCernock´y. Strategies for Training Large Scale Neural Network Language Models, In: Proc. Automatic Speech Recognition and Understand- ing, 2011

work page 2011

[19] [19]

T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno Univer- sity of Technology, 2012

work page 2012

[20] [20]

Mikolov, W.T

T. Mikolov, W.T. Yih, G. Zweig. Linguistic Regularities in Continuous Space Word Represen- tations. NAACL HLT 2013

work page 2013

[21] [21]

Mikolov, I

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. Accepted to NIPS 2013

work page 2013

[22] [22]

A. Mnih, G. Hinton. Three new graphical models for statistical language modelling. ICML, 2007

work page 2007

[23] [23]

A. Mnih, G. Hinton. A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems 21, MIT Press, 2009

work page 2009

[24] [24]

Mnih, Y .W

A. Mnih, Y .W. Teh. A fast and simple algorithm for training neural probabilistic language models. ICML, 2012

work page 2012

[25] [25]

Morin, Y

F. Morin, Y . Bengio. Hierarchical Probabilistic Neural Network Language Model. AISTATS, 2005

work page 2005

[26] [26]

D. E. Rumelhart, G. E. Hinton, R. J. Williams. Learning internal representations by back- propagating errors. Nature, 323:533.536, 1986

work page 1986

[27] [27]

H. Schwenk. Continuous space language models. Computer Speech and Language, vol. 21, 2007

work page 2007

[28] [28]

Socher, E.H

R. Socher, E.H. Huang, J. Pennington, A.Y . Ng, and C.D. Manning. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS, 2011

work page 2011

[29] [29]

Turian, L

J. Turian, L. Ratinov, Y . Bengio. Word Representations: A Simple and General Method for Semi-Supervised Learning. In: Proc. Association for Computational Linguistics, 2010

work page 2010

[30] [30]

P. D. Turney. Measuring Semantic Similarity by Latent Relational Analysis. In: Proc. Interna- tional Joint Conference on Artiﬁcial Intelligence, 2005

work page 2005

[31] [31]

Zhila, W.T

A. Zhila, W.T. Yih, C. Meek, G. Zweig, T. Mikolov. Combining Heterogeneous Models for Measuring Relational Similarity. NAACL HLT 2013

work page 2013

[32] [32]

Zweig, C.J.C

G. Zweig, C.J.C. Burges. The Microsoft Research Sentence Completion Challenge, Microsoft Research Technical Report MSR-TR-2011-129, 2011. 12

work page 2011