PL-MTEB: Polish Massive Text Embedding Benchmark

· 2024 · cs.CL · arXiv 2405.10138

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in the Polish language. PL-MTEB comprises 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. Within the scope of this work, we added 12 new Polish-language tasks to MTEB based on existing datasets and prepared two new datasets used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We analyzed the results in detail for specific task types and model sizes. We made the prepared datasets, the source code for evaluation, and the obtained results available to the public at https://github.com/rafalposwiata/pl-mteb.

representative citing papers

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

cs.CL · 2026-05-14 · unverdicted · novelty 5.0

ML-Embed releases open multilingual embedding models trained with a new 3D-ML framework that reportedly set new MTEB records on 9 of 17 benchmarks, especially in low-resource languages.

citing papers explorer

Showing 1 of 1 citing paper.

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World cs.CL · 2026-05-14 · unverdicted · none · ref 19 · internal anchor
ML-Embed releases open multilingual embedding models trained with a new 3D-ML framework that reportedly set new MTEB records on 9 of 17 benchmarks, especially in low-resource languages.

PL-MTEB: Polish Massive Text Embedding Benchmark

fields

years

verdicts

representative citing papers

citing papers explorer