Generating High Quality Synthetic Data for Dutch Medical Conversations
Pith reviewed 2026-05-15 00:33 UTC · model grok-4.3
The pith
A pipeline using a fine-tuned Dutch LLM guided by real conversations can generate synthetic medical dialogues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that synthetic Dutch medical dialogues can be produced by prompting a Dutch fine-tuned LLM with real conversations as reference material for language and turn structure. Quantitative scores indicate high lexical diversity but excessively regular turn-taking, while qualitative ratings from native speakers and practitioners fall slightly below average due to shortfalls in natural expression and medical domain fit. The limited agreement between the two evaluation types leads to the conclusion that numerical metrics alone miss important quality aspects, and that careful prompting informed by domain expertise is required to reach usable balance between naturalness and structure.
What carries the argument
The generation pipeline that inserts real medical conversations into structured prompts for a Dutch fine-tuned large language model to supply linguistic and dialogue-flow references.
If this is right
- Synthetic dialogues can increase the volume of training data available for Dutch clinical NLP without violating privacy rules.
- Numerical metrics must be paired with human review because they miss shortfalls in naturalness and domain accuracy.
- Domain knowledge must be built into the prompting process to keep generated turns both structured and conversational.
- The pipeline supplies a reusable starting point for producing more Dutch clinical language resources.
Where Pith is reading between the lines
- The same reference-guided prompting approach could be tested on medical dialogues in other languages that also lack public datasets.
- Iterative refinement that feeds practitioner comments back into the prompt structure might raise the qualitative scores.
- Models trained on these dialogues could be checked for how well they handle real-time patient speech that deviates from scripted patterns.
Load-bearing premise
That the generated dialogues will still function as effective training material for clinical NLP models even though they show reduced natural expression and medical specificity.
What would settle it
Train a clinical NLP model on the synthetic dialogues alone, then measure its accuracy on a held-out collection of authentic Dutch medical conversations and compare the result against a model trained without any medical dialogue data.
Figures
read the original abstract
Medical conversations offer insights into clinical communication often absent from Electronic Health Records. However, developing reliable clinical Natural Language Processing (NLP) models is hampered by the scarcity of domain-specific datasets, as clinical data are typically inaccessible due to privacy and ethical constraints. To address these challenges, we present a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned Large Language Model, with real medical conversations serving as linguistic and structural reference. The generated dialogues were evaluated through quantitative metrics and qualitative review by native speakers and medical practitioners. Quantitative analysis revealed strong lexical variety and overly regular turn-taking, suggesting scripted rather than natural conversation flow. Qualitative review produced slightly below-average scores, with raters noting issues in domain specificity and natural expression. The limited correlation between quantitative and qualitative results highlights that numerical metrics alone cannot fully capture linguistic quality. Our findings demonstrate that generating synthetic Dutch medical dialogues is feasible but requires domain knowledge and carefully structured prompting to balance naturalness and structure in conversation. This work provides a foundation for expanding Dutch clinical NLP resources through ethically generated synthetic data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned LLM, with real conversations as linguistic and structural references. The generated dialogues are assessed via quantitative metrics (lexical variety, turn-taking regularity) and qualitative reviews by native speakers and medical practitioners, concluding that generation is feasible but requires domain knowledge and structured prompting to balance naturalness and structure.
Significance. If validated, the work offers a practical foundation for ethically generating Dutch clinical NLP resources where real data are scarce due to privacy constraints. The dual quantitative-qualitative evaluation approach is a positive element, though the absence of downstream task testing limits demonstrated impact on model training.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: No downstream experiments (e.g., training clinical NLP models for intent detection or entity recognition on the synthetic data versus real baselines) are reported, leaving the central utility claim untested despite documented issues with overly regular turn-taking and below-average domain specificity scores.
- [Results] Results/Qualitative review: The reported slightly below-average scores for natural expression and domain specificity, combined with limited correlation to quantitative metrics, directly challenge the assertion that the dialogues can reliably expand Dutch clinical NLP resources; this requires explicit mitigation or additional validation.
minor comments (1)
- [Methods] The description of the prompting structure and reference conversation selection could be expanded for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the scope and positioning of our work. We address each major comment below and have revised the manuscript to strengthen the discussion of limitations and future directions.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: No downstream experiments (e.g., training clinical NLP models for intent detection or entity recognition on the synthetic data versus real baselines) are reported, leaving the central utility claim untested despite documented issues with overly regular turn-taking and below-average domain specificity scores.
Authors: We agree that downstream task evaluation would provide additional evidence for the practical utility of the synthetic dialogues in clinical NLP applications. Our manuscript positions the contribution as a generation pipeline with direct quantitative and qualitative assessment of the output dialogues, rather than a full end-to-end demonstration of model training gains. The central claim is feasibility of generation under privacy constraints, with explicit acknowledgment of limitations such as regular turn-taking and domain specificity. To address the concern, we have added a dedicated paragraph in the Discussion section that outlines planned downstream experiments (e.g., fine-tuning intent detection and NER models) and the expected challenges arising from the observed linguistic issues. revision: partial
-
Referee: [Results] Results/Qualitative review: The reported slightly below-average scores for natural expression and domain specificity, combined with limited correlation to quantitative metrics, directly challenge the assertion that the dialogues can reliably expand Dutch clinical NLP resources; this requires explicit mitigation or additional validation.
Authors: The qualitative scores are reported transparently as slightly below average precisely to highlight the remaining gaps in naturalness and domain specificity. Our conclusion already states that generation is feasible but requires domain knowledge and structured prompting to balance naturalness and structure; the limited correlation between metric families is presented as an important methodological finding rather than a shortcoming to be hidden. We have expanded the Results and Discussion sections with additional analysis of the score discrepancies and concrete mitigation strategies (e.g., iterative prompting with medical terminology constraints and post-generation filtering), thereby strengthening the validation without overstating current reliability. revision: yes
Circularity Check
Empirical generation study with no derivations or self-referential predictions
full rationale
The paper describes a prompting pipeline that uses a Dutch fine-tuned LLM to produce synthetic medical dialogues, then evaluates output via standard lexical metrics plus human ratings from native speakers and practitioners. No equations, fitted parameters, uniqueness theorems, or ansatzes appear. The feasibility claim is supported directly by the reported metrics and ratings rather than by any reduction to prior self-citations or definitional loops. This is a self-contained empirical contribution whose central result does not collapse into its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs fine-tuned on general Dutch text can be prompted to produce domain-specific medical dialogues that preserve linguistic and structural properties of real conversations
Reference graph
Works this paper leans on
-
[1]
Introduction Recent developments in Natural Language Pro- cessing (NLP) have greatly advanced text analy- sis, especially in the medical domain. Specifically, analyzing physician-patient conversations through clinical NLP can enrich research datasets and provide data-driven insights into patient-initiated concerns, which are often absent from Electronic H...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Synthetic medical data generation provides a valuable alternative under these constraints
Related Work One major challenge in clinical NLP is data scarcity and privacy concerns surrounding patient informa- tion. Synthetic medical data generation provides a valuable alternative under these constraints. This approach has been explored for synthetic medi- cal dialogues and clinical notes in several studies (Das et al., 2024; Mianroodi et al., 202...
-
[3]
Methodology In this section, we describe the experimental setup of synthetic Dutch medical dialogue generation in detail. Supporting materials are provided in the Appendices. 3.1. Data In this work, we use a real-life dataset containing transcriptionsofpatient-doctorconversationsinthe nephrology domain from Nivel Institute3’s archive collection as target ...
work page 2024
-
[4]
not always clear about the subject being nephrology
Results and Discussion 4.1. Quantitative Results We evaluated the generated dialogues using struc- tural, lexical, and topic-based metrics. Figure 3 summarizesthedistributionofwordandturncounts across all generated dialogues. D1 D2 D3 D4 D5 D6 D7 D8 D9 0 10 20 30 40 50 60 70Number of T urns 34 25 61 33 48 50 36 38 26 T urns, Mean=39.0 Words, Mean=867.2 0 ...
work page 2010
-
[5]
Conclusion and Future Work WeproposedapipelineleveragingaDutchdataset- fine-tuned LLM to generate synthetic medical dia- logues. In answer to our research question, find- ings indicate that while a Dutch LLM can feasibly produce synthetic medical dialogues that support clinical NLP pipeline development, the generated data do not yet match real-world dialo...
-
[6]
Ethical Considerations and Limitations Syntheticmedicaldialoguesofferanethicallyaware alternative in contexts where data scarcity and pri- vacy concerns restrict the development of clinical NLP. Such corpora can be shared in accordance with FAIR principles (Findable, Accessible, Inter- operable, Reusable)9, promoting data sharing and reproducibility witho...
-
[7]
Acknowledgements This research was supported by the MediSpeech project funded by ITEA4 under contract number 22032. We thank qualitative evaluators - Amir Chaman Baz, Lex Dingemans, Sandra van Dulmen, Edwin Geleijn, Henk van den Heuvel - for their comments on synthetic dialogues, which have led to many findings and further improvements
-
[8]
Bibliographical References Mariam ALMutairi, Lulwah AlKulaib, Melike Aktas, Sara Alsalamah, and Chang-Tien Lu. 2024. Syn- thetic arabic medical dialogues using advanced multi-agent llm techniques. InProceedings of The Second Arabic Natural Language Process- ing Conference, pages 11–26. Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina...
-
[9]
Aline E Gassenn, Luis GM Andrade, Douglas Teodoro, and Jose F Rodrigues-Jr
Expert evaluation of large language mod- els for clinical dialogue summarization.Scientific reports, 15(1):1195. Aline E Gassenn, Luis GM Andrade, Douglas Teodoro, and Jose F Rodrigues-Jr. 2025. Med- ical dialogue audio transcription: Dataset and benchmarking of asr models. InDataset Show- case Workshop (DSW), pages 71–82. SBC. Nicolas Hiebel, Olivier Fer...
work page 2025
-
[10]
In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 574–580
A french medical conversations corpus an- notated for a virtual patient dialogue system. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 574–580. Claudia Alessandra Libbi, Jan Trienes, Dolf Tri- eschnigg, and Christin Seifert. 2021. Gener- ating synthetic training data for supervised de- identification of electronic healt...
-
[11]
Generating synthetic documents with clin- ical keywords: A privacy-sensitive methodology. InProceedings of the First Workshop on Patient- Oriented Language Processing (CL4Health)@ LREC-COLING 2024, pages 115–123. Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Cyril Rakovski, and Frank Rudzicz. 2025. Medsynth: Realistic, synthetic medical dia...
-
[12]
arXiv preprint arXiv:2402.12801
Few-shot clinical entity recognition in en- glish, french and spanish: masked language models outperform generative model prompting. arXiv preprint arXiv:2402.12801. Nictiz and SNOMED International. 2025. SNOMED CT Netherlands Edition. Nic- tiz (Netherlands Release Center). Nictiz, Netherlands Edition. PID https://nictiz.nl/wat-we- doen/activiteiten/termi...
-
[13]
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Opendutchwordnet. InProceedingsofthe 8th Global WordNet Conference (GWC), pages 302–310. Yvette Pyne, Yik Ming Wong, Haishuo Fang, and Edwin Simpson. 2023. Analysis of ‘one in a mil- lion’primary care consultation conversations us- ing natural language processing.BMJ Health & Care Informatics, 30(1):e100659. Okko Räsänen and Daniil Kocharov. 2025. A pipel...
-
[14]
Arts: Goed, laten we nu eens kijken naar wat u al doet en wat we kunnen verbeteren
Appendices 9.1. Prompt Used For Synthetic Dutch medical Dialogue Generation All prompts are written in Dutch. System Prompt, Dutch: Je bent een behulpzame medisch onderzoek- sassistent die Nederlandstalige medische di- alogen genereert. Gebruik alleen ’Patiënt:’ en ’Arts:’ als sprekerlabels. Gebruik alleen alge- meen aanvaarde medische feiten en vermijd h...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.