PL-MTEB: Polish Massive Text Embedding Benchmark
Pith reviewed 2026-05-24 00:41 UTC · model grok-4.3
The pith
PL-MTEB supplies 30 Polish-language tasks across five categories to evaluate text embedding models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PL-MTEB is a benchmark of 30 diverse NLP tasks in Polish, formed by adding 12 tasks from existing resources and two new datasets that support four clustering tasks, allowing direct comparison of 30 embedding models on Polish data.
What carries the argument
The PL-MTEB benchmark, which standardizes evaluation across five task categories and supplies the added Polish datasets and tasks.
If this is right
- Model rankings on Polish data can now be compared directly to rankings on the original MTEB tasks.
- Performance differences between Polish-only and multilingual models become measurable on Polish-specific tasks.
- Task-type and model-size breakdowns identify which embedding approaches work best for particular Polish use cases.
- Public datasets and code enable other groups to add further Polish tasks or rerun evaluations on new models.
Where Pith is reading between the lines
- The same method of adding language-specific tasks could be applied to create comparable benchmarks for other lower-resource languages.
- If Polish-only models outperform multilingual ones on certain task categories, that pattern may guide choices for other Slavic languages.
- The four new clustering tasks could be used to test whether embedding models preserve topic structure in Polish news or social media.
Load-bearing premise
The 12 added tasks and two new datasets reflect typical Polish embedding use cases without annotation artifacts that would change model rankings.
What would settle it
Re-running the 30 models on the 12 new tasks and four new clustering tasks and obtaining model orderings that differ sharply from the orderings on the 18 previously existing tasks.
read the original abstract
In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in the Polish language. PL-MTEB comprises 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. Within the scope of this work, we added 12 new Polish-language tasks to MTEB based on existing datasets and prepared two new datasets used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We analyzed the results in detail for specific task types and model sizes. We made the prepared datasets, the source code for evaluation, and the obtained results available to the public at https://github.com/rafalposwiata/pl-mteb.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in Polish comprising 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. The authors add 12 new Polish-language tasks to MTEB based on existing datasets, prepare two new datasets yielding four clustering tasks, evaluate 30 publicly available embedding models (Polish and multilingual), analyze results by task type and model size, and release the datasets, evaluation code, and results publicly.
Significance. If the added tasks and new datasets are valid and representative, PL-MTEB will serve as a useful standardized resource for Polish text embedding evaluation, filling a gap in multilingual benchmarks. The public release of datasets, code, and results supports reproducibility and community use, which is a clear strength of the work.
minor comments (1)
- [Abstract] The abstract states that two new datasets were prepared for four clustering tasks but provides no information on inter-annotator agreement or how task difficulty was balanced.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of PL-MTEB, accurate summary of the contributions, and recommendation to accept. We are glad the work is viewed as a useful standardized resource for Polish text embedding evaluation.
Circularity Check
No significant circularity
full rationale
The manuscript introduces and releases an empirical benchmark (PL-MTEB) consisting of 30 tasks drawn from existing Polish datasets plus two newly created datasets. It contains no derivations, equations, fitted parameters, or predictions that could reduce to their own inputs. All claims are statements of dataset construction, public release, and model evaluation results; these are externally verifiable and do not rely on self-referential definitions or self-citation chains for their validity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing English MTEB task definitions transfer to Polish with only language-specific data substitution.
Forward citations
Cited by 1 Pith paper
-
ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World
ML-Embed releases open multilingual embedding models trained with a new 3D-ML framework that reportedly set new MTEB records on 9 of 17 benchmarks, especially in low-resource languages.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.