PL-MTEB: Polish Massive Text Embedding Benchmark

Micha{\l} Pere{\l}kiewicz; Rafa{\l} Po\'swiata; S{\l}awomir Dadas

arxiv: 2405.10138 · v2 · submitted 2024-05-16 · 💻 cs.CL

PL-MTEB: Polish Massive Text Embedding Benchmark

Rafa{\l} Po\'swiata , S{\l}awomir Dadas , Micha{\l} Pere{\l}kiewicz This is my paper

Pith reviewed 2026-05-24 00:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords Polish languagetext embeddingsbenchmarkNLP evaluationmultilingual modelsclassification tasksinformation retrievalsemantic similarity

0 comments

The pith

PL-MTEB supplies 30 Polish-language tasks across five categories to evaluate text embedding models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Polish Massive Text Embedding Benchmark, built by extending an existing framework with 12 new tasks drawn from Polish datasets and two newly created datasets that yield four clustering tasks. The resulting collection covers classification, clustering, pair classification, information retrieval, and semantic text similarity. Thirty publicly available models, both Polish-specific and multilingual, were run on the full set of tasks. Results were broken down by task type and model size, and all datasets, evaluation code, and scores were released publicly.

Core claim

PL-MTEB is a benchmark of 30 diverse NLP tasks in Polish, formed by adding 12 tasks from existing resources and two new datasets that support four clustering tasks, allowing direct comparison of 30 embedding models on Polish data.

What carries the argument

The PL-MTEB benchmark, which standardizes evaluation across five task categories and supplies the added Polish datasets and tasks.

If this is right

Model rankings on Polish data can now be compared directly to rankings on the original MTEB tasks.
Performance differences between Polish-only and multilingual models become measurable on Polish-specific tasks.
Task-type and model-size breakdowns identify which embedding approaches work best for particular Polish use cases.
Public datasets and code enable other groups to add further Polish tasks or rerun evaluations on new models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same method of adding language-specific tasks could be applied to create comparable benchmarks for other lower-resource languages.
If Polish-only models outperform multilingual ones on certain task categories, that pattern may guide choices for other Slavic languages.
The four new clustering tasks could be used to test whether embedding models preserve topic structure in Polish news or social media.

Load-bearing premise

The 12 added tasks and two new datasets reflect typical Polish embedding use cases without annotation artifacts that would change model rankings.

What would settle it

Re-running the 30 models on the 12 new tasks and four new clustering tasks and obtaining model orderings that differ sharply from the orderings on the 18 previously existing tasks.

read the original abstract

In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in the Polish language. PL-MTEB comprises 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. Within the scope of this work, we added 12 new Polish-language tasks to MTEB based on existing datasets and prepared two new datasets used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We analyzed the results in detail for specific task types and model sizes. We made the prepared datasets, the source code for evaluation, and the obtained results available to the public at https://github.com/rafalposwiata/pl-mteb.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PL-MTEB adds 12 Polish tasks and two new datasets to MTEB with public code and data, but provides thin detail on how the new datasets were validated.

read the letter

The main point is that the authors have extended MTEB to Polish by adding 12 tasks drawn from existing datasets plus two fresh datasets that yield four clustering tasks, then evaluated 30 models on the full set of 30 tasks and released everything publicly. This gives Polish NLP a ready-to-use evaluation suite that follows the established MTEB structure and includes both language-specific and multilingual models, with breakdowns by task type and model size. The public GitHub release of datasets, code, and results is the practical part that makes the work immediately usable for anyone testing embeddings on Polish text. The addition of original clustering data goes a step beyond simple reuse of prior resources. The soft spot is the lack of information on dataset construction. The text does not report inter-annotator agreement for the new datasets or describe how task difficulty or balance was checked, so it is hard to assess from the paper alone whether the rankings could be distorted by annotation artifacts. That concern is real but contained; the rest of the work sticks to standard benchmark practices without hidden parameters or circular claims. This paper is aimed at researchers who need Polish-specific embedding evaluations or who work on multilingual setups that include mid-resource languages. It is not a broad theoretical advance, yet the concrete artifacts and gap-filling nature make it worth sending to referees rather than desk-rejecting. I would recommend peer review.

Referee Report

0 major / 1 minor

Summary. The paper introduces the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in Polish comprising 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. The authors add 12 new Polish-language tasks to MTEB based on existing datasets, prepare two new datasets yielding four clustering tasks, evaluate 30 publicly available embedding models (Polish and multilingual), analyze results by task type and model size, and release the datasets, evaluation code, and results publicly.

Significance. If the added tasks and new datasets are valid and representative, PL-MTEB will serve as a useful standardized resource for Polish text embedding evaluation, filling a gap in multilingual benchmarks. The public release of datasets, code, and results supports reproducibility and community use, which is a clear strength of the work.

minor comments (1)

[Abstract] The abstract states that two new datasets were prepared for four clustering tasks but provides no information on inter-annotator agreement or how task difficulty was balanced.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of PL-MTEB, accurate summary of the contributions, and recommendation to accept. We are glad the work is viewed as a useful standardized resource for Polish text embedding evaluation.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript introduces and releases an empirical benchmark (PL-MTEB) consisting of 30 tasks drawn from existing Polish datasets plus two newly created datasets. It contains no derivations, equations, fitted parameters, or predictions that could reduce to their own inputs. All claims are statements of dataset construction, public release, and model evaluation results; these are externally verifiable and do not rely on self-referential definitions or self-citation chains for their validity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that existing English MTEB task templates can be faithfully translated and annotated for Polish without introducing systematic bias, plus the usual NLP premise that human-labeled datasets constitute valid ground truth.

axioms (1)

domain assumption Existing English MTEB task definitions transfer to Polish with only language-specific data substitution.
Invoked when the authors state they added 12 new Polish-language tasks based on existing datasets.

pith-pipeline@v0.9.0 · 5675 in / 1260 out tokens · 17236 ms · 2026-05-24T00:41:08.260465+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World
cs.CL 2026-05 unverdicted novelty 5.0

ML-Embed releases open multilingual embedding models trained with a new 3D-ML framework that reportedly set new MTEB records on 9 of 17 benchmarks, especially in low-resource languages.