pith. sign in

arxiv: 2604.09645 · v1 · submitted 2026-03-25 · 💻 cs.CL · cs.AI

Generating High Quality Synthetic Data for Dutch Medical Conversations

Pith reviewed 2026-05-15 00:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords synthetic datamedical dialoguesDutch languageclinical NLPlarge language modelsdialogue generationprivacy-preserving dataqualitative evaluation
0
0 comments X

The pith

A pipeline using a fine-tuned Dutch LLM guided by real conversations can generate synthetic medical dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Scarcity of Dutch medical conversation data limits clinical NLP development because privacy rules block access to real records. The paper builds a generation pipeline that feeds real dialogues as linguistic and structural references into prompts for a Dutch fine-tuned large language model. Quantitative checks find strong word variety yet overly regular speaker turns, while native speakers and medical reviewers give the outputs slightly below-average marks for natural phrasing and medical accuracy. The results show generation is possible when domain knowledge shapes the prompts to keep both structure and conversational feel. The work supplies an ethical route to grow training resources for Dutch clinical language tasks.

Core claim

The authors show that synthetic Dutch medical dialogues can be produced by prompting a Dutch fine-tuned LLM with real conversations as reference material for language and turn structure. Quantitative scores indicate high lexical diversity but excessively regular turn-taking, while qualitative ratings from native speakers and practitioners fall slightly below average due to shortfalls in natural expression and medical domain fit. The limited agreement between the two evaluation types leads to the conclusion that numerical metrics alone miss important quality aspects, and that careful prompting informed by domain expertise is required to reach usable balance between naturalness and structure.

What carries the argument

The generation pipeline that inserts real medical conversations into structured prompts for a Dutch fine-tuned large language model to supply linguistic and dialogue-flow references.

If this is right

  • Synthetic dialogues can increase the volume of training data available for Dutch clinical NLP without violating privacy rules.
  • Numerical metrics must be paired with human review because they miss shortfalls in naturalness and domain accuracy.
  • Domain knowledge must be built into the prompting process to keep generated turns both structured and conversational.
  • The pipeline supplies a reusable starting point for producing more Dutch clinical language resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reference-guided prompting approach could be tested on medical dialogues in other languages that also lack public datasets.
  • Iterative refinement that feeds practitioner comments back into the prompt structure might raise the qualitative scores.
  • Models trained on these dialogues could be checked for how well they handle real-time patient speech that deviates from scripted patterns.

Load-bearing premise

That the generated dialogues will still function as effective training material for clinical NLP models even though they show reduced natural expression and medical specificity.

What would settle it

Train a clinical NLP model on the synthetic dialogues alone, then measure its accuracy on a held-out collection of authentic Dutch medical conversations and compare the result against a model trained without any medical dialogue data.

Figures

Figures reproduced from arXiv: 2604.09645 by Aditya Kamlesh Parikh, Cecilia Kuan, Henk van den Heuvel.

Figure 1
Figure 1. Figure 1: Broader Project Workflow - this study fo [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic Dialogue Text Generation Work [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: shows a box plot of role consistency scores across the nine dialogues. The clustering of scores within the blue box suggests relatively similar word choice for both doctor and patient across dialogues, with one outlier indicating that one dialogue con￾tained comparatively more role-specific vocabulary. All scores fall below the heuristic baseline reference (gray shaded area). 0.00 0.01 0.02 0.03 0.04 0.05 … view at source ↗
Figure 5
Figure 5. Figure 5: Role Consistency - Roles D1 D2 D3 D4 D5 D6 D7 D8 D9 0 5 10 15 20 25 ASL 21.35 17.50 11.61 19.67 20.33 14.79 11.71 17.70 10.97 ASL, Mean=16.18 SPT, Mean=2.14 0.0 0.5 1.0 1.5 2.0 2.5 3.0 SPT 1.88 2.44 1.62 1.82 2.12 2.44 1.92 2.79 2.23 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scores: TTR & MSTTR 4.2. Qualitative Results Human raters evaluated the dialogues across five qualitative categories [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Topic Coverage - Proportion per Dialogue [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Human Evaluation - Scores per Category across all categories (consistently below 0.12, with 60% of scores below zero), indicating substantial disagreement among raters [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Human Evaluation - per Rater in each Category [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Human Evaluation - Inter Rater Reliabil [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Spearman Correlation (ρ) - Qualitative vs Quantitative Scores These quantitative-qualitative discrepancies echo rater comments, noting issues such as un￾clear domain focus ("not always clear about the subject being nephrology"), unnatural word choice resembling translated English, and inconsistencies in typical expressions ("errors in typical Flemish expressions"). Additional remarks include multi￾ple gre… view at source ↗
read the original abstract

Medical conversations offer insights into clinical communication often absent from Electronic Health Records. However, developing reliable clinical Natural Language Processing (NLP) models is hampered by the scarcity of domain-specific datasets, as clinical data are typically inaccessible due to privacy and ethical constraints. To address these challenges, we present a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned Large Language Model, with real medical conversations serving as linguistic and structural reference. The generated dialogues were evaluated through quantitative metrics and qualitative review by native speakers and medical practitioners. Quantitative analysis revealed strong lexical variety and overly regular turn-taking, suggesting scripted rather than natural conversation flow. Qualitative review produced slightly below-average scores, with raters noting issues in domain specificity and natural expression. The limited correlation between quantitative and qualitative results highlights that numerical metrics alone cannot fully capture linguistic quality. Our findings demonstrate that generating synthetic Dutch medical dialogues is feasible but requires domain knowledge and carefully structured prompting to balance naturalness and structure in conversation. This work provides a foundation for expanding Dutch clinical NLP resources through ethically generated synthetic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a pipeline for generating synthetic Dutch medical dialogues using a Dutch fine-tuned LLM, with real conversations as linguistic and structural references. The generated dialogues are assessed via quantitative metrics (lexical variety, turn-taking regularity) and qualitative reviews by native speakers and medical practitioners, concluding that generation is feasible but requires domain knowledge and structured prompting to balance naturalness and structure.

Significance. If validated, the work offers a practical foundation for ethically generating Dutch clinical NLP resources where real data are scarce due to privacy constraints. The dual quantitative-qualitative evaluation approach is a positive element, though the absence of downstream task testing limits demonstrated impact on model training.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: No downstream experiments (e.g., training clinical NLP models for intent detection or entity recognition on the synthetic data versus real baselines) are reported, leaving the central utility claim untested despite documented issues with overly regular turn-taking and below-average domain specificity scores.
  2. [Results] Results/Qualitative review: The reported slightly below-average scores for natural expression and domain specificity, combined with limited correlation to quantitative metrics, directly challenge the assertion that the dialogues can reliably expand Dutch clinical NLP resources; this requires explicit mitigation or additional validation.
minor comments (1)
  1. [Methods] The description of the prompting structure and reference conversation selection could be expanded for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the scope and positioning of our work. We address each major comment below and have revised the manuscript to strengthen the discussion of limitations and future directions.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: No downstream experiments (e.g., training clinical NLP models for intent detection or entity recognition on the synthetic data versus real baselines) are reported, leaving the central utility claim untested despite documented issues with overly regular turn-taking and below-average domain specificity scores.

    Authors: We agree that downstream task evaluation would provide additional evidence for the practical utility of the synthetic dialogues in clinical NLP applications. Our manuscript positions the contribution as a generation pipeline with direct quantitative and qualitative assessment of the output dialogues, rather than a full end-to-end demonstration of model training gains. The central claim is feasibility of generation under privacy constraints, with explicit acknowledgment of limitations such as regular turn-taking and domain specificity. To address the concern, we have added a dedicated paragraph in the Discussion section that outlines planned downstream experiments (e.g., fine-tuning intent detection and NER models) and the expected challenges arising from the observed linguistic issues. revision: partial

  2. Referee: [Results] Results/Qualitative review: The reported slightly below-average scores for natural expression and domain specificity, combined with limited correlation to quantitative metrics, directly challenge the assertion that the dialogues can reliably expand Dutch clinical NLP resources; this requires explicit mitigation or additional validation.

    Authors: The qualitative scores are reported transparently as slightly below average precisely to highlight the remaining gaps in naturalness and domain specificity. Our conclusion already states that generation is feasible but requires domain knowledge and structured prompting to balance naturalness and structure; the limited correlation between metric families is presented as an important methodological finding rather than a shortcoming to be hidden. We have expanded the Results and Discussion sections with additional analysis of the score discrepancies and concrete mitigation strategies (e.g., iterative prompting with medical terminology constraints and post-generation filtering), thereby strengthening the validation without overstating current reliability. revision: yes

Circularity Check

0 steps flagged

Empirical generation study with no derivations or self-referential predictions

full rationale

The paper describes a prompting pipeline that uses a Dutch fine-tuned LLM to produce synthetic medical dialogues, then evaluates output via standard lexical metrics plus human ratings from native speakers and practitioners. No equations, fitted parameters, uniqueness theorems, or ansatzes appear. The feasibility claim is supported directly by the reported metrics and ratings rather than by any reduction to prior self-citations or definitional loops. This is a self-contained empirical contribution whose central result does not collapse into its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that an LLM fine-tuned on Dutch can produce medically plausible dialogues when guided by real examples, and that current metrics plus human ratings are sufficient to judge usability.

axioms (1)
  • domain assumption LLMs fine-tuned on general Dutch text can be prompted to produce domain-specific medical dialogues that preserve linguistic and structural properties of real conversations
    Invoked in the pipeline description as the basis for generation.

pith-pipeline@v0.9.0 · 5480 in / 1190 out tokens · 35025 ms · 2026-05-15T00:33:26.610410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    Introduction Recent developments in Natural Language Pro- cessing (NLP) have greatly advanced text analy- sis, especially in the medical domain. Specifically, analyzing physician-patient conversations through clinical NLP can enrich research datasets and provide data-driven insights into patient-initiated concerns, which are often absent from Electronic H...

  2. [2]

    Synthetic medical data generation provides a valuable alternative under these constraints

    Related Work One major challenge in clinical NLP is data scarcity and privacy concerns surrounding patient informa- tion. Synthetic medical data generation provides a valuable alternative under these constraints. This approach has been explored for synthetic medi- cal dialogues and clinical notes in several studies (Das et al., 2024; Mianroodi et al., 202...

  3. [3]

    yes", "no

    Methodology In this section, we describe the experimental setup of synthetic Dutch medical dialogue generation in detail. Supporting materials are provided in the Appendices. 3.1. Data In this work, we use a real-life dataset containing transcriptionsofpatient-doctorconversationsinthe nephrology domain from Nivel Institute3’s archive collection as target ...

  4. [4]

    not always clear about the subject being nephrology

    Results and Discussion 4.1. Quantitative Results We evaluated the generated dialogues using struc- tural, lexical, and topic-based metrics. Figure 3 summarizesthedistributionofwordandturncounts across all generated dialogues. D1 D2 D3 D4 D5 D6 D7 D8 D9 0 10 20 30 40 50 60 70Number of T urns 34 25 61 33 48 50 36 38 26 T urns, Mean=39.0 Words, Mean=867.2 0 ...

  5. [5]

    Conclusion and Future Work WeproposedapipelineleveragingaDutchdataset- fine-tuned LLM to generate synthetic medical dia- logues. In answer to our research question, find- ings indicate that while a Dutch LLM can feasibly produce synthetic medical dialogues that support clinical NLP pipeline development, the generated data do not yet match real-world dialo...

  6. [6]

    Ethical Considerations and Limitations Syntheticmedicaldialoguesofferanethicallyaware alternative in contexts where data scarcity and pri- vacy concerns restrict the development of clinical NLP. Such corpora can be shared in accordance with FAIR principles (Findable, Accessible, Inter- operable, Reusable)9, promoting data sharing and reproducibility witho...

  7. [7]

    Acknowledgements This research was supported by the MediSpeech project funded by ITEA4 under contract number 22032. We thank qualitative evaluators - Amir Chaman Baz, Lex Dingemans, Sandra van Dulmen, Edwin Geleijn, Henk van den Heuvel - for their comments on synthetic dialogues, which have led to many findings and further improvements

  8. [8]

    Bibliographical References Mariam ALMutairi, Lulwah AlKulaib, Melike Aktas, Sara Alsalamah, and Chang-Tien Lu. 2024. Syn- thetic arabic medical dialogues using advanced multi-agent llm techniques. InProceedings of The Second Arabic Natural Language Process- ing Conference, pages 11–26. Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina...

  9. [9]

    Aline E Gassenn, Luis GM Andrade, Douglas Teodoro, and Jose F Rodrigues-Jr

    Expert evaluation of large language mod- els for clinical dialogue summarization.Scientific reports, 15(1):1195. Aline E Gassenn, Luis GM Andrade, Douglas Teodoro, and Jose F Rodrigues-Jr. 2025. Med- ical dialogue audio transcription: Dataset and benchmarking of asr models. InDataset Show- case Workshop (DSW), pages 71–82. SBC. Nicolas Hiebel, Olivier Fer...

  10. [10]

    In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 574–580

    A french medical conversations corpus an- notated for a virtual patient dialogue system. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 574–580. Claudia Alessandra Libbi, Jan Trienes, Dolf Tri- eschnigg, and Christin Seifert. 2021. Gener- ating synthetic training data for supervised de- identification of electronic healt...

  11. [11]

    InProceedings of the First Workshop on Patient- Oriented Language Processing (CL4Health)@ LREC-COLING 2024, pages 115–123

    Generating synthetic documents with clin- ical keywords: A privacy-sensitive methodology. InProceedings of the First Workshop on Patient- Oriented Language Processing (CL4Health)@ LREC-COLING 2024, pages 115–123. Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Cyril Rakovski, and Frank Rudzicz. 2025. Medsynth: Realistic, synthetic medical dia...

  12. [12]

    arXiv preprint arXiv:2402.12801

    Few-shot clinical entity recognition in en- glish, french and spanish: masked language models outperform generative model prompting. arXiv preprint arXiv:2402.12801. Nictiz and SNOMED International. 2025. SNOMED CT Netherlands Edition. Nic- tiz (Netherlands Release Center). Nictiz, Netherlands Edition. PID https://nictiz.nl/wat-we- doen/activiteiten/termi...

  13. [13]

    Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

    Opendutchwordnet. InProceedingsofthe 8th Global WordNet Conference (GWC), pages 302–310. Yvette Pyne, Yik Ming Wong, Haishuo Fang, and Edwin Simpson. 2023. Analysis of ‘one in a mil- lion’primary care consultation conversations us- ing natural language processing.BMJ Health & Care Informatics, 30(1):e100659. Okko Räsänen and Daniil Kocharov. 2025. A pipel...

  14. [14]

    Arts: Goed, laten we nu eens kijken naar wat u al doet en wat we kunnen verbeteren

    Appendices 9.1. Prompt Used For Synthetic Dutch medical Dialogue Generation All prompts are written in Dutch. System Prompt, Dutch: Je bent een behulpzame medisch onderzoek- sassistent die Nederlandstalige medische di- alogen genereert. Gebruik alleen ’Patiënt:’ en ’Arts:’ als sprekerlabels. Gebruik alleen alge- meen aanvaarde medische feiten en vermijd h...