pith. sign in

arxiv: 2604.12744 · v1 · submitted 2026-04-14 · 💻 cs.CL

Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

Pith reviewed 2026-05-10 15:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords named entity recognitionmultilingual benchmarksannotation guidelinescross-lingual evaluationUniversal NERdataset creationNLP benchmarks
0
0 comments X

The pith

Universal NER v2 expands gold-standard named entity annotations across more languages with a shared tagset and guidelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the second version of the Universal NER benchmark as an ongoing effort to create standardized named entity recognition datasets in many languages. It relies on one general tagset and detailed annotation guidelines so that entity spans receive consistent labels no matter the language. This release builds directly on the 2024 v1 dataset by adding new languages and community contributors. A reader would care because multilingual language models are widely used, yet most languages still lack reliable test data to measure how well those models handle named entities. If the standardized approach succeeds, the field gains a growing common reference set for comparing systems across languages.

Core claim

The paper claims that a community-coordinated project using a single tagset and thorough guidelines can produce reliable, cross-lingual named entity annotations at scale, and it reports the current state of this effort through the Universal NER v2 release that extends the initial 2024 dataset.

What carries the argument

The general tagset and annotation guidelines that define consistent rules for identifying and labeling named entity spans across languages.

If this is right

  • Researchers gain gold-standard NER evaluation data for additional languages beyond the v1 set.
  • Direct comparisons of model performance become possible on a wider range of languages using the same label definitions.
  • The benchmark can be used to test whether multilingual models maintain entity recognition quality outside high-resource languages.
  • The project shows a practical path for expanding the number of languages covered without requiring a single central annotation team.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same community annotation model could be adapted to create benchmarks for related tasks such as coreference resolution or entity linking.
  • If the datasets continue to grow, they might reveal systematic gaps in how current models handle entities in particular language families.
  • Widespread adoption would let developers train single models that perform entity recognition across dozens or hundreds of languages at once.

Load-bearing premise

Annotations produced by a distributed community of organizers and annotators across languages will achieve consistent quality and adherence to the guidelines.

What would settle it

Independent review of a sample of annotations from several languages revealing large differences in which text spans receive entity labels or which categories are assigned.

Figures

Figures reproduced from arXiv: 2604.12744 by Eugene Jang, Eungseo Kim, Hila Gonen, Jeongyeon Seo, Jun Kevin, Kaja Dobrovoljc, Marek \v{S}uppa, Shachar Mirkin, Stephen Mayhew, Terra Blevins, Vasile Pais, Voula Giouli, Xenophon Gialis, Yuval Pinter.

Figure 1
Figure 1. Figure 1: Comparison of the dataset statistics for [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A cross-lingual comparison of UNER annotations on top of parallel text (PUD). We consider the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experimental results for in-language and cross-lingual UNER performance with XLM-R [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: F1 score comparison across 19 multi￾lingual NER datasets (test sets) for three LLMs against human Inter-Annotator Agreement (IAA) baseline. Bold values indicate the best-performing model for each dataset; underlined values indicate scores exceeding human agreement. 5. Prior Work In parallel to the UNER efforts, new multilingual datasets are being created to support broader instruction tuning and evaluation… view at source ↗
read the original abstract

While multilingual language models promise to bring the benefits of LLMs to speakers of many languages, gold-standard evaluation benchmarks in most languages to interrogate these assumptions remain scarce. The Universal NER project, now entering its fourth year, is dedicated to building gold-standard multilingual Named Entity Recognition (NER) benchmark datasets. Inspired by existing massively multilingual efforts for other core NLP tasks (e.g., Universal Dependencies), the project uses a general tagset and thorough annotation guidelines to collect standardized, cross-lingual annotations of named entity spans. The first installment (UNER v1) was released in 2024, and the project has continued and expanded since then, with various organizers, annotators, and collaborators in an active community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript describes the Universal NER v2 project, which expands on v1 to build gold-standard multilingual Named Entity Recognition benchmark datasets. It employs a general tagset and detailed annotation guidelines to produce standardized, cross-lingual annotations of named entity spans through a distributed community of organizers, annotators, and collaborators.

Significance. If the annotations achieve verifiable high consistency and quality across languages, the resource would address a critical gap in evaluation benchmarks for multilingual LLMs, enabling more rigorous cross-lingual comparisons similar to the impact of Universal Dependencies.

major comments (2)
  1. [Abstract] Abstract: The central claim that the collected datasets constitute 'gold-standard' multilingual NER benchmarks is unsupported, as the text supplies no quantitative evidence on annotation quality, inter-annotator agreement scores, coverage statistics, or validation/adjudication procedures. This directly undermines the reliability of the standardization claim.
  2. [Abstract] The description of the annotation workflow (distributed organizers and annotators using shared guidelines) lacks any mechanism for ensuring cross-lingual consistency or resolving disagreements, leaving the weakest assumption about community-driven quality unaddressed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. We address the two major comments point by point below, outlining specific revisions that will strengthen the manuscript's support for the gold-standard claim and the annotation workflow description.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the collected datasets constitute 'gold-standard' multilingual NER benchmarks is unsupported, as the text supplies no quantitative evidence on annotation quality, inter-annotator agreement scores, coverage statistics, or validation/adjudication procedures. This directly undermines the reliability of the standardization claim.

    Authors: We agree that the abstract, being a high-level summary, does not include quantitative metrics, and the manuscript text as a whole provides limited numerical evidence on these points. The 'gold-standard' designation in the paper refers primarily to the use of human annotators following detailed, shared guidelines rather than automatic or silver-standard methods, consistent with projects like Universal Dependencies. To directly address the concern, we will revise the abstract to reference available quality indicators and add a dedicated subsection (or expand an existing methods section) with inter-annotator agreement scores, coverage statistics, and validation procedures drawn from the v1 release and ongoing v2 collection where they exist. revision: yes

  2. Referee: [Abstract] The description of the annotation workflow (distributed organizers and annotators using shared guidelines) lacks any mechanism for ensuring cross-lingual consistency or resolving disagreements, leaving the weakest assumption about community-driven quality unaddressed.

    Authors: The abstract provides only a brief overview of the distributed community model. The full manuscript describes the roles of organizers and the use of shared guidelines, but we acknowledge it does not explicitly detail cross-lingual consistency mechanisms or disagreement resolution. In the revision, we will expand the workflow description to include how organizers coordinate across languages (e.g., through regular alignment discussions and guideline updates), how language-specific teams apply the guidelines uniformly, and the procedures for resolving annotator disagreements via adjudication. revision: yes

Circularity Check

0 steps flagged

No significant circularity: data collection and benchmark release paper

full rationale

The paper describes the Universal NER v2 project as a community-driven effort to collect and release multilingual NER annotation datasets using shared guidelines and a general tagset. No mathematical derivations, equations, predictions, fitted parameters, or first-principles results are present in the abstract or described structure. The central claims concern data standardization and expansion from v1, justified by reference to prior efforts like Universal Dependencies rather than any self-referential reduction. No load-bearing steps reduce to inputs by construction, self-citation chains, or ansatz smuggling. This is a standard dataset release paper whose quality assumptions (e.g., annotation consistency) are external to any internal derivation and can be evaluated against reported IAA or adjudication details if provided.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, derivations, or empirical claims requiring parameters or axioms; the paper is a resource description with no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5470 in / 979 out tokens · 51140 ms · 2026-05-10T15:14:11.612509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

    Introduction While multilingual language models promise to bring the benefits of LLMs to speakers of many languages, gold-standard evaluation benchmarks in most languages to interrogate these assump- tions remain scarce. The Universal NER project, now entering its fourth year, is dedicated to building gold-standard multilingual Named Entity Recogni- tion ...

  2. [2]

    Universal NER The development of multilingual Named Entity Recognition benchmarks has accelerated in recent years. The Universal NER project (UNER v1 May- hew et al., 2024) established a community-driven gold standard resource in 13 languages, follow- ing the philosophy of projects such as Universal Dependencies (UD; de Marneffe et al., 2021) or the PARSE...

  3. [3]

    and the efficacy of synthetic data for NER tasks (Kamath and Vajjala, 2025)—as well as broader fundamental questions concerning cross- lingual transfer (Liu and Niehues, 2025; Chen et al., 2023b) and multilingual knowledge distillation (Wi- bowoetal.,2024). Thedatasetcontinuestoenable explorations across a wide spectrum of multilin- gual research directio...

  4. [4]

    This updated version of UNER also incorporates minor anno- tation fixes for some existing datasets originally released in v1 (namely, English’sEWT and PUD)

    UNER v2 Creation and Dataset UniversalNER(UNER)v2isanextensionofUNER that adds 11 new datasets in 10 new language va- rieties (with Norwegian represented by two written standards, Nynorsk and Bokmål), namely Modern Greek, Hebrew, Norwegian Nynorsk, Norwegian Bokmål, Slovenian, Swedish, Czech, Indonesian, Japanese, Korean, and Romanian (Table 2; the full d...

  5. [5]

    Experiments We perform two sets of experiments: (a) using a finetuned cross-lingual encoder, XLM-R (Conneau et al., 2020), to be comparable with our prior work, Section 4.1; (b) by directly prompting three large language models (LLMs), as detailed in Section 4.2. 4.1. Traditional baselines Experiment SetupThis section establishes baselines on the new data...

  6. [6]

    Prior Work In parallel to the UNER efforts, new multilingual datasets are being created to support broader instruction tuning and evaluation. The Aya dataset (Singh et al., 2024) provides large-scale multilingual resources for instruction tuning, while Zhang and Xiao (2024) propose a classification framework to better organize and understand the diversity...

  7. [7]

    also highlight emerging directions beyond conventional text-only approaches

  8. [8]

    Conclusion We presented Universal NER v2, a new and sub- stantially expanded version of the ongoing Univer- sal NER project. We are excited to see the steady pace at which the resource is growing, and hope thatitcanreachthemagnitudeofothermassiveen- deavorsinmultilinguallinguisticannotationsuchas Universal Dependencies and UniMorph. As we re- lease v2, si...

  9. [9]

    References David Demitri Africa, Suchir Salhan, Yuval Weiss, Paula Buttery, and Richard Diehl Martinez. 2025. Meta-pretraining for zero-shot cross-lingual named entity recognition in low-resource philip- pinelanguages.arXiv preprint arXiv:2509.02160. Lars Ahrenberg. 2015. Converting an english- swedish parallel treebank to universal dependen- cies. InProc...

  10. [10]

    Sergei Bogdanov, Alexandre Constantin, Timothée Bernard, Benoit Crabbé, and Etienne Bernard

    In-context learning on a budget: A case study in named entity recognition.arXiv e-prints, pages arXiv–2406. Sergei Bogdanov, Alexandre Constantin, Timothée Bernard, Benoit Crabbé, and Etienne Bernard

  11. [11]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11829– 11841

    NuNER: Entity recognition encoder pre- training via LLM-annotated data. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11829– 11841. Sotiris Boutsis, Iason Demiros, Voula Giouli, Maria Liakata, Harris Papageorgiou, and Ste- lios Piperidis. 2000. A system for recognition of named entities in greek. InProceed...

  12. [12]

    Marie-Catherine de Marneffe, Christopher D

    Structured information extraction from sci- entific text with large language models.Nature Communications, 15:1418. Marie-Catherine de Marneffe, Christopher D. Man- ning, Joakim Nivre, and Daniel Zeman. 2021. UniversalDependencies.Computational Linguis- tics, 47(2):255–308. Kaja Dobrovoljc, Tomaž Erjavec, and Simon Krek

  13. [13]

    InProceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 33–38, Valencia, Spain

    The Universal Dependencies treebank for Slovenian. InProceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 33–38, Valencia, Spain. Association for Computational Linguistics. Darius Feher, Ivan Vulić, and Benjamin Minixhofer

  14. [14]

    InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 29866–29883, Vienna, Austria

    Retrofitting large language models with dynamictokenization. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 29866–29883, Vienna, Austria. Associ- ation for Computational Linguistics. Voula Giouli, Alexis Konstandinidis, Elina Desypri, and Harris Papageorgiou. 2006. Multi-domain mu...

  15. [15]

    In Proceedings of the nodalida 2017 workshop on universal dependencies (udw 2017), pages 102– 106

    Universal dependencies for greek. In Proceedings of the nodalida 2017 workshop on universal dependencies (udw 2017), pages 102– 106. Vasile Păis,, Maria Mitrofan, Carol Luca Gasan, Alexandru Ianov, Corvin Ghi t,ă, Vlad Silviu Coneschi, and Andrei Onut,. 2024. Legalnero: A linked corpus for named entity recognition in the romanian legal domain.Semantic Web...

  16. [16]

    Gpt-ner: Named entity recognition via large language models,

    Towards benchmarking situational aware- ness of large language models: Comprehensive benchmark, evaluation and analysis. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7904–7928. Reut Tsarfaty. 2013. A unified morpho-syntactic scheme of Stanford dependencies. InProceed- ings of the 51st Annual Meeting of the Associ- ation f...

  17. [17]

    Technical report, Centre for Language Resources and Technologies (CJVT), University of Ljubljana

    Annotationguidelinesforsloveniannamed entities Janes-NER. Technical report, Centre for Language Resources and Technologies (CJVT), University of Ljubljana