arxiv: 2604.25374 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI

Recognition: unknown

Language corpora for the Dutch medical domain

B. van Es

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Dutch medical corpuslanguage resourcesNLP pre-trainingmedical text extractiontranslated datasetsDutch healthcare NLPopen data corpus

0 comments

The pith

A 35-billion-token Dutch medical corpus has been assembled from translations and extractions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to overcome the scarcity of Dutch medical text data that has limited natural language processing in healthcare. It does so by translating English medical datasets, spotting medical content inside larger Dutch collections, and pulling together open Dutch medical resources. The result is a single large corpus that researchers can use for training language models and for tasks like information extraction. Without such a resource, Dutch medical NLP has had to rely on smaller or English-only data, slowing progress on language-specific tools. If the corpus meets its quality goals, it supplies a ready foundation that can be applied immediately to model development and clinical applications.

Core claim

By translating English medical datasets, identifying medical text within generic Dutch corpora, and extracting open Dutch medical resources, the work produces a corpus of roughly 35 billion tokens spread across about 100 million documents. This collection is presented as the first large-scale Dutch medical language resource suitable for pre-training language models and for downstream NLP tasks, and it is released freely on Hugging Face.

What carries the argument

The three-step corpus construction process of translation of English medical data, medical-text identification in broader Dutch sources, and direct extraction of open Dutch medical documents.

If this is right

The corpus can serve as pre-training data for Dutch medical language models.
It supports downstream tasks such as medical information extraction or question answering in Dutch.
Researchers gain a public starting point instead of building medical data collections from scratch.
The described methods offer a repeatable template for expanding the corpus with new sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The resource could enable experiments that compare Dutch medical language patterns directly against English ones using the translated components.
Similar construction pipelines might be tested on other low-resource languages to test whether the same scale can be reached without original large medical collections.
Subsets of the corpus could be annotated for supervised tasks, turning the raw text into training data for classification or entity recognition models.

Load-bearing premise

The translations, medical-text identifications, and extractions produce a high-quality, representative corpus without major errors, biases, or irrelevant content that would undermine its utility for NLP.

What would settle it

An evaluation showing that language models trained on this corpus achieve no improvement over models trained on generic Dutch text or smaller medical samples when tested on standard Dutch medical NLP benchmarks.

read the original abstract

\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper describes the creation of a large-scale Dutch medical language corpus through translation of English medical datasets, identification of medical text in generic Dutch corpora, and extraction of open Dutch medical resources. The resulting resource contains approximately 35 billion tokens in about 100 million documents and is released freely on Hugging Face for pre-training and downstream NLP tasks.

Significance. If the corpus proves to be high-quality and representative, this would be a valuable contribution by filling a gap in Dutch medical NLP resources, enabling better model development in a low-resource language-domain combination. The scale is notable, but the absence of quality metrics limits the assessed impact.

major comments (2)

[Methods] Methods: The pipeline description (translation of English datasets, heuristic/model-based medical text identification in generic corpora, and open resource extraction) reports no precision/recall figures, inter-annotator agreement, or human evaluation of translation accuracy for medical terminology. This directly undermines the central claim that the corpus is genuinely medical-domain and suitable for pre-training.
[Results] Results: The final corpus size of ±35 billion tokens is stated without accompanying details on filtering criteria, deduplication steps, or measured error rates from the identification and translation stages, leaving the representativeness unsupported.

minor comments (2)

[Abstract] The abstract's use of '±' for the token count is informal; replace with 'approximately' for consistency with academic style.
[Conclusion] The conclusion asserts this is the 'first' large-scale Dutch medical corpus; a short comparison to any prior smaller Dutch medical datasets would strengthen this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing the Dutch medical language corpus. The comments correctly identify areas where additional methodological detail and validation would strengthen the presentation. We address each major comment below and commit to revisions that provide the requested information without overstating what was performed.

read point-by-point responses

Referee: [Methods] Methods: The pipeline description (translation of English datasets, heuristic/model-based medical text identification in generic corpora, and open resource extraction) reports no precision/recall figures, inter-annotator agreement, or human evaluation of translation accuracy for medical terminology. This directly undermines the central claim that the corpus is genuinely medical-domain and suitable for pre-training.

Authors: We agree that the absence of quantitative validation metrics leaves the medical-domain claim less substantiated than it could be. The manuscript describes the pipeline at a high level because the primary contribution is the public release of the assembled corpus rather than a new extraction method. Translations were performed on established English medical datasets using professional services, identification combined keyword heuristics and off-the-shelf classifiers drawn from prior medical NLP literature, and extraction targeted explicitly medical Dutch sources. In the revision we will add a Validation subsection that reports (a) precision estimates obtained from manual review of random samples (approximately 1,000 documents per major source type), (b) inter-annotator agreement for those reviews, and (c) qualitative notes on translation accuracy for medical terminology based on expert spot-checks. We will also explicitly discuss the limitations of not conducting exhaustive evaluation at this scale. revision: yes
Referee: [Results] Results: The final corpus size of ±35 billion tokens is stated without accompanying details on filtering criteria, deduplication steps, or measured error rates from the identification and translation stages, leaving the representativeness unsupported.

Authors: The current manuscript states the final token count and document count but indeed omits granular post-processing details. We will revise the Methods and Results sections to specify the filtering criteria applied after initial collection (document length thresholds, language identification confidence scores, and removal of boilerplate), the deduplication procedure (including the algorithm and similarity threshold employed), and any available error-rate indicators from the identification and translation stages. These additions will allow readers to better assess the final corpus composition and representativeness within the Dutch medical domain. revision: yes

Circularity Check

0 steps flagged

No circularity in data compilation and release paper

full rationale

The manuscript describes the assembly of a Dutch medical corpus via translation of existing English datasets, heuristic/model-based identification of medical text within generic Dutch corpora, and extraction of open Dutch medical resources, resulting in a released 35B-token collection. No equations, predictions, fitted parameters, or derivation chains appear in the provided abstract or described methods. The central claim is the existence and availability of the corpus itself rather than any computed result that could reduce to its inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained as a direct data resource contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied resource-construction paper. No free parameters, mathematical axioms, or new postulated entities are introduced beyond routine NLP data-processing steps.

pith-pipeline@v0.9.0 · 5358 in / 1197 out tokens · 70021 ms · 2026-05-07T16:21:37.645803+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 4 canonical work pages · 1 internal anchor

[1]

p. 1–67. Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:210100027. 2020;. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for ...

2020
[2]

1703–1714

p. 1703–1714. Available from: https://aclanthology.org/2020.acl-main.156/. Ordelman R, de Jong F, Van Hessen A, Hondorp H. TWnC: a multifaceted Dutch news corpus. ELRA Newsletter. 2007;12(3/4):4–7. 8 Oostdijk N, Reynaert M, Monachesi P, Van Noord G, Ordelman R, Schuurman I, et al. From D-Coi to SoNaR: a reference corpus for Dutch. In: Calzolari N, Choukri...

2020
[3]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Available from: https://aclanthology.org/L08-1226/. Penedo G, Kydl´ ıˇ cek H, allal LB, Lozhkov A, Mitchell M, Raffel C, et al.: The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. Available from: https://arxiv.org/abs/2406.17557. Delobelle P, Winters T, Berendt B. RobBERT: a Dutch RoBERTa-based Language Model. In: Proceedings of th...

work page internal anchor Pith review arXiv 2020
[4]

3255–3265

p. 3255–3265. Menger V, Scheepers F, van Wijk LM, Spruit M. DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text. Telematics and Infor- matics. 2018;35(4):727 –

2018
[5]

https://doi.org/https://doi.org/10.1016/j.tele.2017. 08.002. Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association. 2017;24(3):596–606. Team N, Costa-juss` a MR, Cross J, C ¸ elebi O, Elbayad M, Heafield K, et al.: No Lan- guage Left Behind: ...

work page doi:10.1016/j.tele.2017 2017
[6]

p. 479–480. Available from: https://aclanthology.org/ 2020.eamt-1.61. Wang X, Chen N, Chen J, Wang Y, Zhen G, Zhang C, et al.: Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People. Available from: https://arxiv.org/abs/2403.03640. Johnson AE, Pollard TJ, Shen L, Lehman LwH, Feng M, Ghassemi M, et al. MIMIC- III, a f...

work page arXiv 2020
[7]

p. 265–272. Available from: https://aclanthology.org/2023.bionlp-1.23. Remy F, Demuynck K, Demeester T. BioLORD: Learning Ontological Representa- tions from Definitions for Biomedical Concepts and their Textual Descriptions. In: Findings of the Association for Computational Linguistics: EMNLP 2022

2023
[8]

MEDITRON-70B: Scaling medical pretraining for large language models

p. 1454–1465. Remy F, Demuynck K, Demeester T. BioLORD-2023: Semantic textual repre- sentations fusing llm and clinical knowledge graph insights. arXiv preprint arXiv:231116075. 2023;. 10 Chen Z, Cano AH, Romanou A, Bonnet A, Matoba K, Salvi F, et al.: MEDITRON- 70B: Scaling Medical Pretraining for Large Language Models. Available from: https: //arxiv.org...

work page arXiv 2023
[9]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences. 2021;11(14):6421. Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring Mas- sive Multitask Language Understanding. In: International Conference on Lea...

2021
[10]

MedAlpaca–an open-source collection of medical conversational AI models and training data

Han T, Adams LC, Papaioannou JM, Grundmann P, Oberhauser T, Figueroa A, et al. MedAlpaca–an open-source collection of medical conversational AI models and training data. arXiv preprint arXiv:230408247. 2023;. Krithara A, Nentidis A, Bougiatiotis K, Paliouras G. BioASQ-QA: A manually curated corpus for Biomedical Question Answering. Scientific Data. 2023;1...

2023