pith. sign in

arxiv: 2405.10138 · v2 · submitted 2024-05-16 · 💻 cs.CL

PL-MTEB: Polish Massive Text Embedding Benchmark

Pith reviewed 2026-05-24 00:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords Polish languagetext embeddingsbenchmarkNLP evaluationmultilingual modelsclassification tasksinformation retrievalsemantic similarity
0
0 comments X

The pith

PL-MTEB supplies 30 Polish-language tasks across five categories to evaluate text embedding models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Polish Massive Text Embedding Benchmark, built by extending an existing framework with 12 new tasks drawn from Polish datasets and two newly created datasets that yield four clustering tasks. The resulting collection covers classification, clustering, pair classification, information retrieval, and semantic text similarity. Thirty publicly available models, both Polish-specific and multilingual, were run on the full set of tasks. Results were broken down by task type and model size, and all datasets, evaluation code, and scores were released publicly.

Core claim

PL-MTEB is a benchmark of 30 diverse NLP tasks in Polish, formed by adding 12 tasks from existing resources and two new datasets that support four clustering tasks, allowing direct comparison of 30 embedding models on Polish data.

What carries the argument

The PL-MTEB benchmark, which standardizes evaluation across five task categories and supplies the added Polish datasets and tasks.

If this is right

  • Model rankings on Polish data can now be compared directly to rankings on the original MTEB tasks.
  • Performance differences between Polish-only and multilingual models become measurable on Polish-specific tasks.
  • Task-type and model-size breakdowns identify which embedding approaches work best for particular Polish use cases.
  • Public datasets and code enable other groups to add further Polish tasks or rerun evaluations on new models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same method of adding language-specific tasks could be applied to create comparable benchmarks for other lower-resource languages.
  • If Polish-only models outperform multilingual ones on certain task categories, that pattern may guide choices for other Slavic languages.
  • The four new clustering tasks could be used to test whether embedding models preserve topic structure in Polish news or social media.

Load-bearing premise

The 12 added tasks and two new datasets reflect typical Polish embedding use cases without annotation artifacts that would change model rankings.

What would settle it

Re-running the 30 models on the 12 new tasks and four new clustering tasks and obtaining model orderings that differ sharply from the orderings on the 18 previously existing tasks.

read the original abstract

In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in the Polish language. PL-MTEB comprises 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. Within the scope of this work, we added 12 new Polish-language tasks to MTEB based on existing datasets and prepared two new datasets used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We analyzed the results in detail for specific task types and model sizes. We made the prepared datasets, the source code for evaluation, and the obtained results available to the public at https://github.com/rafalposwiata/pl-mteb.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper introduces the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in Polish comprising 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. The authors add 12 new Polish-language tasks to MTEB based on existing datasets, prepare two new datasets yielding four clustering tasks, evaluate 30 publicly available embedding models (Polish and multilingual), analyze results by task type and model size, and release the datasets, evaluation code, and results publicly.

Significance. If the added tasks and new datasets are valid and representative, PL-MTEB will serve as a useful standardized resource for Polish text embedding evaluation, filling a gap in multilingual benchmarks. The public release of datasets, code, and results supports reproducibility and community use, which is a clear strength of the work.

minor comments (1)
  1. [Abstract] The abstract states that two new datasets were prepared for four clustering tasks but provides no information on inter-annotator agreement or how task difficulty was balanced.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of PL-MTEB, accurate summary of the contributions, and recommendation to accept. We are glad the work is viewed as a useful standardized resource for Polish text embedding evaluation.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript introduces and releases an empirical benchmark (PL-MTEB) consisting of 30 tasks drawn from existing Polish datasets plus two newly created datasets. It contains no derivations, equations, fitted parameters, or predictions that could reduce to their own inputs. All claims are statements of dataset construction, public release, and model evaluation results; these are externally verifiable and do not rely on self-referential definitions or self-citation chains for their validity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that existing English MTEB task templates can be faithfully translated and annotated for Polish without introducing systematic bias, plus the usual NLP premise that human-labeled datasets constitute valid ground truth.

axioms (1)
  • domain assumption Existing English MTEB task definitions transfer to Polish with only language-specific data substitution.
    Invoked when the authors state they added 12 new Polish-language tasks based on existing datasets.

pith-pipeline@v0.9.0 · 5675 in / 1260 out tokens · 17236 ms · 2026-05-24T00:41:08.260465+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

    cs.CL 2026-05 unverdicted novelty 5.0

    ML-Embed releases open multilingual embedding models trained with a new 3D-ML framework that reportedly set new MTEB records on 9 of 17 benchmarks, especially in low-resource languages.