MTEB: Massive Text Embedding Benchmark

Lo\"ic Magne; Niklas Muennighoff; Nils Reimers; Nouamane Tazi

arxiv: 2210.07316 · v3 · pith:Z52BXLJPnew · submitted 2022-10-13 · 💻 cs.CL · cs.IR· cs.LG

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff , Nouamane Tazi , Lo\"ic Magne , Nils Reimers This is my paper

Pith reviewed 2026-05-15 10:11 UTC · model grok-4.3

classification 💻 cs.CL cs.IRcs.LG

keywords text embeddingsbenchmarkevaluationsemantic textual similarityclusteringrerankingmultilingualleaderboard

0 comments

The pith

A new benchmark shows no single text embedding method performs best across all tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluations of text embeddings have long focused on narrow sets of datasets from one task, leaving unclear how well models transfer to other uses like clustering or reranking. The paper introduces MTEB as a broader test covering eight tasks, fifty-eight datasets, and one hundred twelve languages. Benchmarking thirty-three models on this suite reveals that performance rankings shift sharply depending on the task. This indicates the field has not settled on one embedding approach that scales to top results everywhere. The benchmark supplies open code and a public leaderboard to make future comparisons more consistent.

Core claim

The paper establishes the Massive Text Embedding Benchmark (MTEB) that spans eight embedding tasks across fifty-eight datasets and one hundred twelve languages. By evaluating thirty-three models on MTEB, the work finds that no particular text embedding method dominates across all tasks, which suggests the field has yet to converge on a universal text embedding method scaled sufficiently for state-of-the-art results on every embedding task.

What carries the argument

The Massive Text Embedding Benchmark (MTEB), a standardized collection of eight tasks and fifty-eight datasets that measures text embedding performance across diverse applications.

If this is right

Embedding models must be tested on multiple tasks instead of relying on semantic similarity alone.
Progress requires either new general methods or task-aware selection rather than one-size-fits-all scaling.
A public leaderboard will allow direct tracking of improvements across the full set of tasks.
Developers will need to weigh task-specific strengths when choosing an embedding for a given application.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Research groups may shift from single-task optimization to methods designed for balanced performance across the eight categories.
The benchmark could become a default check for any new embedding model before it is released.
Task-specific fine-tuning or routing mechanisms might emerge as practical ways to handle the observed specialization.

Load-bearing premise

The eight tasks and fifty-eight datasets chosen for MTEB represent the full range of real-world embedding applications so that scores on MTEB predict usefulness elsewhere.

What would settle it

A single new embedding model that ranks first on every one of the eight MTEB tasks at once, or a follow-up study showing that MTEB scores fail to predict performance in previously untested practical applications.

read the original abstract

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at https://github.com/embeddings-benchmark/mteb.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces the Massive Text Embedding Benchmark (MTEB), spanning 8 tasks, 58 datasets, and 112 languages. By evaluating 33 models on this suite, the authors establish the most comprehensive text embedding benchmark to date and report that no single embedding method achieves top performance across all tasks.

Significance. If the reported results hold, MTEB supplies a standardized, multi-task evaluation resource that directly addresses the prior limitation of narrow, single-task assessments (e.g., STS-only). The open-source code, public leaderboard, and fully reproducible experimental setup constitute concrete strengths that enable community verification and incremental progress tracking.

minor comments (3)

§3.2: The criteria used to select the 58 datasets within each task are stated at a high level; adding a short paragraph or table listing the primary inclusion/exclusion rules would improve transparency without altering the central claim.
Table 2: The reported scores for the 33 models would benefit from an additional column or footnote indicating the number of runs or standard deviation, even if the main text already notes single-run evaluation.
Figure 3: The radar-chart comparison of top models is visually effective, but the legend ordering does not match the task order in the caption; reordering would reduce reader cross-referencing.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We are pleased that the significance of MTEB as a standardized, multi-task benchmark for text embeddings is recognized, along with the value of the open-source code and public leaderboard.

Circularity Check

0 steps flagged

Pure empirical benchmark with no circular derivation

full rationale

The paper introduces MTEB as a benchmark spanning 8 tasks and 58 datasets, evaluates 33 models, and reports that no single embedding method dominates all tasks. This finding is a direct empirical observation from external datasets and model performances, with no equations, fitted parameters, or self-citations forming a load-bearing derivation chain. The task selection is presented as a practical choice rather than derived from prior results in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark paper. It introduces no free parameters, no new axioms beyond standard assumptions about vector similarity, and no invented entities.

axioms (1)

standard math Text embeddings can be meaningfully compared via cosine similarity or dot product on vector representations.
Invoked in the definition of the STS and retrieval tasks.

pith-pipeline@v0.9.0 · 5488 in / 1098 out tokens · 29336 ms · 2026-05-15T10:11:33.391781+00:00 · methodology

discussion (0)

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
cs.LG 2026-05 unverdicted novelty 7.0

Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and ...
AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions
cs.CL 2026-05 unverdicted novelty 7.0

AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
Much of Geospatial Web Search Is Beyond Traditional GIS
cs.IR 2026-05 unverdicted novelty 7.0

Analysis of 1.01 million unfiltered Bing queries identifies 18% as geospatial, dominated by transactional categories like costs (15.3%) that exceed traditional GIS scope.
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models
cs.IR 2026-05 unverdicted novelty 7.0

CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
cs.IR 2026-04 unverdicted novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
cs.CR 2025-12 unverdicted novelty 7.0

DualGuard uses adaptive dual-stream watermark signals to detect and trace both paraphrase and spoofing attacks in LLM outputs while preserving text quality.
Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker
cs.CL 2025-11 unverdicted novelty 7.0

UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-...
Representational Alignment Across Model Layers and Brain Regions with Multi-Level Optimal Transport
cs.LG 2025-10 accept novelty 7.0

Multi-Level Optimal Transport (MOT) jointly infers soft layer couplings and neuron transport plans to produce global alignment scores and structured hierarchical correspondences between networks of varying depths.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
cs.CL 2026-05 accept novelty 6.0

Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
Sliced Inner Product Gromov-Wasserstein Distances
stat.ML 2026-05 unverdicted novelty 6.0

A sliced IGW distance is introduced with closed-form 1D expressions, rotational invariance, and studied structural and computational properties for efficient data alignment.
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
cs.CL 2026-05 unverdicted novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining
cs.CL 2026-04 unverdicted novelty 6.0

MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme lo...
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
cs.IR 2026-04 accept novelty 6.0

JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
Semantic Data Processing with Holistic Data Understanding
cs.DB 2026-04 unverdicted novelty 6.0

HoldUp uses LLM-guided clustering to provide holistic dataset context for semantic operators, yielding up to 33% higher classification accuracy and 30% higher scoring accuracy than row-by-row LLM processing across 15 ...
Mitigating Membership Inference in Intermediate Representations with Differentially Private Training
cs.LG 2026-02 unverdicted novelty 6.0

LM-DP-SGD estimates layer-specific MIA risks from shadow models and reweights gradients to give stronger protection to vulnerable layers, improving the privacy-utility trade-off over uniform DP-SGD.
SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass
cs.CL 2026-02 unverdicted novelty 6.0

SHINE trains a scalable in-context hypernetwork to generate high-quality LoRA adapters from contexts in one pass, enabling efficient LLM adaptation that saves time and compute compared to standard fine-tuning.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
cs.CL 2025-12 unverdicted novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering
cs.CL 2025-11 unverdicted novelty 6.0

LLM-MemCluster gives LLMs stateful memory and prompts that let them decide cluster count and iteratively refine groupings, outperforming baselines on benchmarks in a tuning-free end-to-end setup.
Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network
cs.CL 2025-10 unverdicted novelty 6.0

Introduces FraudSquad, a hybrid model using language model embeddings and a gated graph transformer that outperforms baselines on newly created LLM-generated spam review datasets.
EmbeddingGemma: Powerful and Lightweight Text Representations
cs.CL 2025-09 unverdicted novelty 6.0

A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
Verbalized Algorithms: Classical Algorithms are All You Need (Mostly)
cs.CL 2025-09 unverdicted novelty 6.0

Verbalized algorithms integrate LLMs as oracles for simple string operations within classical algorithms to improve accuracy-runtime tradeoffs on sorting, clustering, submodular maximization, and multi-hop QA.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
REPLUG: Retrieval-Augmented Black-Box Language Models
cs.CL 2023-01 conditional novelty 6.0

REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
CALMem : Application-Layer Dual Memory for Conversational AI
cs.IR 2026-05 unverdicted novelty 5.0

CALMem delivers virtually unbounded effective context for LLM conversations via an application-layer dual memory architecture with intra-session retrieval and token-adaptive injection.
Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings
cs.CL 2026-05 unverdicted novelty 5.0

Embeddings reliably capture authorial stylistic features in French literary texts, and these signals persist after LLM rewriting while showing model-specific patterns.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
cs.SE 2026-05 conditional novelty 5.0

Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts
cs.SE 2026-04 conditional novelty 5.0

STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.
Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction
cs.CL 2026-03 unverdicted novelty 5.0

A configurable pipeline turns text corpora into quantitative semantic signals via embeddings, logprobs, and UMAP-based noise reduction for document positioning and corpus profiling.
Robustness Risk of Conversational Retrieval: Identifying and Mitigating Noise Sensitivity in Qwen3-Embedding Model
cs.IR 2026-02 unverdicted novelty 5.0

Qwen3-embedding models show noise sensitivity in conversational retrieval where dialogue artifacts rank highly despite lacking semantic value, a problem reduced by query prompting and more severe than in prior Qwen ve...
GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs
cs.IR 2025-11 accept novelty 5.0

GovScape delivers multimodal search over 10 million government PDFs using metadata, exact text, semantic embeddings, and visual page features at an estimated $1,500 preprocessing cost.
Search-R3: Unifying Reasoning and Embedding in Large Language Models
cs.CL 2025-10 unverdicted novelty 5.0

Search-R3 trains LLMs to output search embeddings as a direct product of step-by-step reasoning via supervised pre-training and a specialized RL environment that avoids full corpus re-encoding.
Text Embeddings by Weakly-Supervised Contrastive Pre-training
cs.CL 2022-12 unverdicted novelty 5.0

E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.
Domain-Adaptive Dense Retrieval for Brazilian Legal Search
cs.IR 2026-05 unverdicted novelty 4.0

Mixed training of Qwen3-Embedding-4B on legal data plus SQuAD-pt yields higher average NDCG@10 (0.447), MRR@10 (0.595), and MAP@10 (0.308) across six Portuguese retrieval datasets than legal-only or base models, with ...
Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings
cs.IR 2025-07 unverdicted novelty 4.0

Lightweight federated learning with frozen embeddings and MLP heads reaches competitive micro and macro F1 scores for ICD-9 and ICD-10 coding on MIMIC-IV, nearly matching centralized training.
Query pipeline optimization for cancer patient question answering systems
cs.CL 2024-12 unverdicted novelty 4.0

Three-aspect RAG query pipeline optimization for cancer patient QA introduces HSRDR and SEOS and reports 5.24% accuracy gain on Claude-3-haiku versus chain-of-thought on a custom dataset.