arxiv: 2604.02904 · v1 · submitted 2026-04-03 · 💻 cs.CL

Recognition: no theorem link

BioUNER: A Benchmark Dataset for Clinical Urdu Named Entity Recognition

Wazir Ali , Adeeb Noor , Sanaullah Mahar , Alia , Muhammad Mazhar Younas

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords urdunamed entity recognitionbiomedicaldatasetbenchmarkclinical textnatural language processing

0 comments

The pith

A gold-standard benchmark dataset for biomedical named entity recognition in Urdu has been assembled from 153,000 annotated tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BioUNER, a benchmark dataset for clinical Urdu named entity recognition. Health-related Urdu texts were collected from news portals, medical prescriptions, and hospital blogs, preprocessed, and labeled by three native annotators familiar with the medical domain using the Doccano tool. The resulting dataset of 153K tokens reaches an inter-annotator agreement of 0.78 and is used to test models including SVM, LSTM, mBERT, and XLM-RoBERTa. This resource fills a gap in Urdu language processing by providing labeled clinical text for entity recognition tasks.

Core claim

The authors created the BioUNER dataset as a gold-standard benchmark for biomedical named entity recognition in Urdu by crawling health-related articles from online news portals, prescriptions, and hospital websites, annotating 153K tokens with three domain-familiar native speakers via Doccano, achieving 0.78 inter-annotator agreement, and demonstrating utility through evaluations of SVM, LSTM, mBERT, and XLM-RoBERTa models.

What carries the argument

The BioUNER dataset of 153K annotated Urdu tokens for biomedical entities, sourced from diverse health-related texts and labeled through a three-annotator Doccano process validated by inter-annotator agreement.

Load-bearing premise

The crawled Urdu health texts from news, prescriptions, and blogs represent typical clinical language, and the three-annotator process produces reliable gold-standard labels.

What would settle it

A follow-up study that applies the same annotation guidelines to actual hospital patient records and finds substantially lower model performance or inter-annotator agreement than reported on BioUNER would indicate the crawled sources do not match real clinical usage.

Figures

Figures reproduced from arXiv: 2604.02904 by Adeeb Noor, Alia, Muhammad Mazhar Younas, Sanaullah Mahar, Wazir Ali.

read the original abstract

In this article, we present a gold-standard benchmark dataset for Biomedical Urdu Named Entity Recognition (BioUNER), developed by crawling health-related articles from online Urdu news portals, medical prescriptions, and hospital health blogs and websites. After preprocessing, three native annotators with familiarity in the medical domain participated in the annotation process using the Doccano text annotation tool and annotated 153K tokens. Following annotation, the proposed BioiUNER dataset was evaluated both intrinsically and extrinsically. An inter-annotator agreement score of 0.78 was achieved, thereby validating the dataset as gold-standard quality. To demonstrate the utility and benchmarking capability of the dataset, we evaluated several machine learning and deep learning models, including Support Vector Machines (SVM), Long Short-Term Memory networks (LSTM), Multilingual BERT (mBERT), and XLM-RoBERTa. The gold-standard BioUNER dataset serves as a reliable benchmark and a valuable addition to Urdu language processing resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 1 minor

Summary. The manuscript presents BioUNER, a benchmark dataset for biomedical named entity recognition in Urdu. It describes crawling health-related Urdu texts from news portals, prescriptions, and hospital blogs, followed by annotation of 153K tokens by three domain-familiar native annotators using the Doccano tool. The authors report an inter-annotator agreement of 0.78 and claim gold-standard quality. They further evaluate the dataset extrinsically by training and testing SVM, LSTM, mBERT, and XLM-RoBERTa models.

Significance. A properly documented and released Urdu clinical NER benchmark would address a clear resource gap in low-resource biomedical NLP. The scale (153K tokens) and reported IAA are promising, but the absence of entity definitions, guidelines, and release information prevents the work from currently functioning as a verifiable benchmark.

major comments (4)

[Abstract] Abstract: The entity types and annotation schema are never defined. Without an explicit inventory of clinical entities (e.g., disease, drug, symptom) and their scope, the IAA score of 0.78 cannot be interpreted as evidence of gold-standard quality.
[Annotation Process] Annotation section: No description is given of the annotation guidelines provided to the three annotators, the adjudication procedure for disagreements, or any quality-control steps beyond the single IAA figure. These omissions are load-bearing for the central gold-standard claim.
[Dataset Release] Dataset availability: The manuscript does not state whether the labeled BioUNER data will be released publicly (with or without license), which is a prerequisite for any dataset to serve as a community benchmark.
[Evaluation] Evaluation section: Model results are mentioned but no numerical performance figures, baseline comparisons, or error analysis are supplied, making it impossible to assess whether the dataset actually supports reproducible benchmarking.

minor comments (1)

[Abstract] Abstract contains the typo 'BioiUNER' instead of 'BioUNER'.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive report. The comments correctly identify several omissions that prevent the current manuscript from fully functioning as a verifiable benchmark. We will revise the paper to address each point and strengthen the documentation of the dataset.

read point-by-point responses

Referee: [Abstract] Abstract: The entity types and annotation schema are never defined. Without an explicit inventory of clinical entities (e.g., disease, drug, symptom) and their scope, the IAA score of 0.78 cannot be interpreted as evidence of gold-standard quality.

Authors: We agree. The revised manuscript will include a dedicated subsection that explicitly lists the clinical entity types (DISEASE, DRUG, SYMPTOM, PROCEDURE, ANATOMY, and others), provides concise definitions for each, and describes the annotation scope and boundary rules. This will allow readers to interpret the IAA figure meaningfully. revision: yes
Referee: [Annotation Process] Annotation section: No description is given of the annotation guidelines provided to the three annotators, the adjudication procedure for disagreements, or any quality-control steps beyond the single IAA figure. These omissions are load-bearing for the central gold-standard claim.

Authors: We accept the criticism. The revised annotation section will describe the written guidelines given to annotators, the pilot annotation round used for training, the adjudication process (discussion followed by majority vote for persistent disagreements), and additional quality-control steps such as periodic consistency checks and review of low-agreement documents. revision: yes
Referee: [Dataset Release] Dataset availability: The manuscript does not state whether the labeled BioUNER data will be released publicly (with or without license), which is a prerequisite for any dataset to serve as a community benchmark.

Authors: We will add an explicit statement that the full annotated BioUNER dataset will be released publicly under a CC-BY 4.0 license upon acceptance, together with the annotation guidelines and a data card. A permanent repository link will be included in the camera-ready version. revision: yes
Referee: [Evaluation] Evaluation section: Model results are mentioned but no numerical performance figures, baseline comparisons, or error analysis are supplied, making it impossible to assess whether the dataset actually supports reproducible benchmarking.

Authors: The current draft only summarizes the models evaluated. In the revision we will insert a full results table with precision, recall, and F1 scores for SVM, LSTM, mBERT, and XLM-RoBERTa, include standard baselines (e.g., CRF), and add a short error analysis highlighting the most frequent confusion types. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset creation relies on external crawling and independent human annotation

full rationale

The paper's central process—crawling health-related Urdu sources, preprocessing, annotating 153K tokens via Doccano with three domain-familiar annotators, computing IAA=0.78, and evaluating off-the-shelf models (SVM, LSTM, mBERT, XLM-RoBERTa)—contains no derivations, equations, fitted parameters, or self-citations that reduce to the inputs by construction. The benchmark claim rests on external data collection and human labeling rather than any self-definitional loop or renamed known result. Missing annotation guidelines or adjudication details affect verifiability but do not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of crawled health texts and the reliability of three-annotator labeling; no free parameters, invented entities, or non-standard axioms are introduced beyond standard NER task assumptions.

axioms (1)

standard math Named entity recognition is a well-defined sequence labeling task in NLP
Background assumption invoked when defining the annotation and evaluation process.

pith-pipeline@v0.9.0 · 5473 in / 1320 out tokens · 40815 ms · 2026-05-13T19:57:18.679759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Wazir Ali, Junyu Lu, and Zenglin Xu

Biomedical named entity recognition and linking datasets: survey and our recent development.Brief- ings in Bioinformatics, 21(6):2219–2238. Wazir Ali, Junyu Lu, and Zenglin Xu. 2020. Siner: A large dataset for sindhi named entity recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2953–2961. Reza Averly and Xia N...

work page 2020
[2]

InProceedings of the 3rd Clinical Natural Language Processing Workshop, pages 234–242

Joint learning with pre-trained transformer on named entity recognition and relation extraction tasks for clinical analytics. InProceedings of the 3rd Clinical Natural Language Processing Workshop, pages 234–242. Nigel Collier, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Jin-Dong Kim. 2004. Introduction to the bio-entity recognition task at jnlpba....

work page 2004
[3]

UmrinderPal Singh, Vishal Goyal, and Gurpreet Singh Lehal

Conditional random fields for object recog- nition.Advances in neural information processing systems, 17. UmrinderPal Singh, Vishal Goyal, and Gurpreet Singh Lehal. 2012. Named entity recognition system for urdu. InProceedings of COLING 2012, pages 2507– 2518. Larry Smith, Lorraine K Tanabe, Rie Johnson Nee Ando, Cheng-Ju Kuo, I-Fang Chung, Chun-Nan Hsu, ...

work page 2012
[4]

Zhiyu Zhang and Arbee LP Chen

Chinese medical named entity recognition using pre-trained language models and an efficient global pointer mechanism.Informatica, 49(19). Zhiyu Zhang and Arbee LP Chen. 2022. Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning.BMC bioinformatics, 23(1):458

work page 2022