Recognition: unknown
Language corpora for the Dutch medical domain
Pith reviewed 2026-05-07 16:21 UTC · model grok-4.3
The pith
A 35-billion-token Dutch medical corpus has been assembled from translations and extractions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By translating English medical datasets, identifying medical text within generic Dutch corpora, and extracting open Dutch medical resources, the work produces a corpus of roughly 35 billion tokens spread across about 100 million documents. This collection is presented as the first large-scale Dutch medical language resource suitable for pre-training language models and for downstream NLP tasks, and it is released freely on Hugging Face.
What carries the argument
The three-step corpus construction process of translation of English medical data, medical-text identification in broader Dutch sources, and direct extraction of open Dutch medical documents.
If this is right
- The corpus can serve as pre-training data for Dutch medical language models.
- It supports downstream tasks such as medical information extraction or question answering in Dutch.
- Researchers gain a public starting point instead of building medical data collections from scratch.
- The described methods offer a repeatable template for expanding the corpus with new sources.
Where Pith is reading between the lines
- The resource could enable experiments that compare Dutch medical language patterns directly against English ones using the translated components.
- Similar construction pipelines might be tested on other low-resource languages to test whether the same scale can be reached without original large medical collections.
- Subsets of the corpus could be annotated for supervised tasks, turning the raw text into training data for classification or entity recognition models.
Load-bearing premise
The translations, medical-text identifications, and extractions produce a high-quality, representative corpus without major errors, biases, or irrelevant content that would undermine its utility for NLP.
What would settle it
An evaluation showing that language models trained on this corpus achieve no improvement over models trained on generic Dutch text or smaller medical samples when tested on standard Dutch medical NLP benchmarks.
read the original abstract
\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes the creation of a large-scale Dutch medical language corpus through translation of English medical datasets, identification of medical text in generic Dutch corpora, and extraction of open Dutch medical resources. The resulting resource contains approximately 35 billion tokens in about 100 million documents and is released freely on Hugging Face for pre-training and downstream NLP tasks.
Significance. If the corpus proves to be high-quality and representative, this would be a valuable contribution by filling a gap in Dutch medical NLP resources, enabling better model development in a low-resource language-domain combination. The scale is notable, but the absence of quality metrics limits the assessed impact.
major comments (2)
- [Methods] Methods: The pipeline description (translation of English datasets, heuristic/model-based medical text identification in generic corpora, and open resource extraction) reports no precision/recall figures, inter-annotator agreement, or human evaluation of translation accuracy for medical terminology. This directly undermines the central claim that the corpus is genuinely medical-domain and suitable for pre-training.
- [Results] Results: The final corpus size of ±35 billion tokens is stated without accompanying details on filtering criteria, deduplication steps, or measured error rates from the identification and translation stages, leaving the representativeness unsupported.
minor comments (2)
- [Abstract] The abstract's use of '±' for the token count is informal; replace with 'approximately' for consistency with academic style.
- [Conclusion] The conclusion asserts this is the 'first' large-scale Dutch medical corpus; a short comparison to any prior smaller Dutch medical datasets would strengthen this claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript describing the Dutch medical language corpus. The comments correctly identify areas where additional methodological detail and validation would strengthen the presentation. We address each major comment below and commit to revisions that provide the requested information without overstating what was performed.
read point-by-point responses
-
Referee: [Methods] Methods: The pipeline description (translation of English datasets, heuristic/model-based medical text identification in generic corpora, and open resource extraction) reports no precision/recall figures, inter-annotator agreement, or human evaluation of translation accuracy for medical terminology. This directly undermines the central claim that the corpus is genuinely medical-domain and suitable for pre-training.
Authors: We agree that the absence of quantitative validation metrics leaves the medical-domain claim less substantiated than it could be. The manuscript describes the pipeline at a high level because the primary contribution is the public release of the assembled corpus rather than a new extraction method. Translations were performed on established English medical datasets using professional services, identification combined keyword heuristics and off-the-shelf classifiers drawn from prior medical NLP literature, and extraction targeted explicitly medical Dutch sources. In the revision we will add a Validation subsection that reports (a) precision estimates obtained from manual review of random samples (approximately 1,000 documents per major source type), (b) inter-annotator agreement for those reviews, and (c) qualitative notes on translation accuracy for medical terminology based on expert spot-checks. We will also explicitly discuss the limitations of not conducting exhaustive evaluation at this scale. revision: yes
-
Referee: [Results] Results: The final corpus size of ±35 billion tokens is stated without accompanying details on filtering criteria, deduplication steps, or measured error rates from the identification and translation stages, leaving the representativeness unsupported.
Authors: The current manuscript states the final token count and document count but indeed omits granular post-processing details. We will revise the Methods and Results sections to specify the filtering criteria applied after initial collection (document length thresholds, language identification confidence scores, and removal of boilerplate), the deduplication procedure (including the algorithm and similarity threshold employed), and any available error-rate indicators from the identification and translation stages. These additions will allow readers to better assess the final corpus composition and representativeness within the Dutch medical domain. revision: yes
Circularity Check
No circularity in data compilation and release paper
full rationale
The manuscript describes the assembly of a Dutch medical corpus via translation of existing English datasets, heuristic/model-based identification of medical text within generic Dutch corpora, and extraction of open Dutch medical resources, resulting in a released 35B-token collection. No equations, predictions, fitted parameters, or derivation chains appear in the provided abstract or described methods. The central claim is the existence and availability of the corpus itself rather than any computed result that could reduce to its inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained as a direct data resource contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
p. 1–67. Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:210100027. 2020;. Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for ...
2020
-
[2]
1703–1714
p. 1703–1714. Available from: https://aclanthology.org/2020.acl-main.156/. Ordelman R, de Jong F, Van Hessen A, Hondorp H. TWnC: a multifaceted Dutch news corpus. ELRA Newsletter. 2007;12(3/4):4–7. 8 Oostdijk N, Reynaert M, Monachesi P, Van Noord G, Ordelman R, Schuurman I, et al. From D-Coi to SoNaR: a reference corpus for Dutch. In: Calzolari N, Choukri...
2020
-
[3]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Available from: https://aclanthology.org/L08-1226/. Penedo G, Kydl´ ıˇ cek H, allal LB, Lozhkov A, Mitchell M, Raffel C, et al.: The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. Available from: https://arxiv.org/abs/2406.17557. Delobelle P, Winters T, Berendt B. RobBERT: a Dutch RoBERTa-based Language Model. In: Proceedings of th...
work page internal anchor Pith review arXiv 2020
-
[4]
3255–3265
p. 3255–3265. Menger V, Scheepers F, van Wijk LM, Spruit M. DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text. Telematics and Infor- matics. 2018;35(4):727 –
2018
-
[5]
https://doi.org/https://doi.org/10.1016/j.tele.2017. 08.002. Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association. 2017;24(3):596–606. Team N, Costa-juss` a MR, Cross J, C ¸ elebi O, Elbayad M, Heafield K, et al.: No Lan- guage Left Behind: ...
-
[6]
p. 479–480. Available from: https://aclanthology.org/ 2020.eamt-1.61. Wang X, Chen N, Chen J, Wang Y, Zhen G, Zhang C, et al.: Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People. Available from: https://arxiv.org/abs/2403.03640. Johnson AE, Pollard TJ, Shen L, Lehman LwH, Feng M, Ghassemi M, et al. MIMIC- III, a f...
-
[7]
p. 265–272. Available from: https://aclanthology.org/2023.bionlp-1.23. Remy F, Demuynck K, Demeester T. BioLORD: Learning Ontological Representa- tions from Definitions for Biomedical Concepts and their Textual Descriptions. In: Findings of the Association for Computational Linguistics: EMNLP 2022
2023
-
[8]
MEDITRON-70B: Scaling medical pretraining for large language models
p. 1454–1465. Remy F, Demuynck K, Demeester T. BioLORD-2023: Semantic textual repre- sentations fusing llm and clinical knowledge graph insights. arXiv preprint arXiv:231116075. 2023;. 10 Chen Z, Cano AH, Romanou A, Bonnet A, Matoba K, Salvi F, et al.: MEDITRON- 70B: Scaling Medical Pretraining for Large Language Models. Available from: https: //arxiv.org...
-
[9]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams
Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences. 2021;11(14):6421. Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring Mas- sive Multitask Language Understanding. In: International Conference on Lea...
2021
-
[10]
MedAlpaca–an open-source collection of medical conversational AI models and training data
Han T, Adams LC, Papaioannou JM, Grundmann P, Oberhauser T, Figueroa A, et al. MedAlpaca–an open-source collection of medical conversational AI models and training data. arXiv preprint arXiv:230408247. 2023;. Krithara A, Nentidis A, Bougiatiotis K, Paliouras G. BioASQ-QA: A manually curated corpus for Biomedical Question Answering. Scientific Data. 2023;1...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.