arxiv: 2604.07549 · v2 · submitted 2026-04-08 · 💻 cs.CL · cs.AI

Recognition: no theorem link

EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

Xueren Ge , Sahil Murtaza , Anthony Cortez , Homa Alemzadeh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords synthetic dialogue generationemergency medical servicesconversational diagnosis predictionmulti-agent LLMelectronic patient care reportsmulti-speaker dialoguesEMS workflow

0 comments

The pith

A pipeline using multiple large language models generates synthetic multi-speaker emergency medical dialogues from patient reports and improves conversational diagnosis prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generation method that starts from real electronic patient care reports and uses coordinated language models to plan topics, create multi-person dialogues, and apply checks for accuracy and flow. This produces a dataset of over four thousand annotated EMS conversations with speaker roles, diagnoses, and turn topics. Human and model-based checks rate the outputs as realistic. When added to training data, the synthetic dialogues raise the accuracy, timeliness, and stability of models that predict diagnoses from ongoing conversations. The work addresses the shortage of suitable multi-party medical dialogue data for this task.

Core claim

The authors build an iterative multi-LLM pipeline that grounds dialogue creation in real ePCR data, enforces topic flow, and runs rule-based factual and consistency checks. The process produces EMSDialog, a collection of 4,414 synthetic multi-speaker EMS conversations annotated with 43 diagnoses, speaker roles, and turn-level topics. Both human evaluators and LLM judges rate the dialogues highly on realism and quality using utterance- and conversation-level measures. Training diagnosis-prediction models on data augmented with EMSDialog yields gains in accuracy, timeliness, and stability compared with training on real data alone.

What carries the argument

The ePCR-grounded, topic-flow-based multi-agent generation pipeline that plans, generates, and self-refines dialogues while applying rule-based factual and topic-flow checks.

If this is right

Models can track evolving clinical evidence across multiple speakers and decide when to commit to a diagnosis with greater reliability.
The annotated dataset supports finer-grained study of how information flows among EMS team members during calls.
Synthetic data becomes a practical supplement when real multi-party medical conversations are scarce or restricted.
The same grounded generation approach could scale to other medical or emergency-response dialogue settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the generated dialogues preserve the timing and uncertainty patterns of actual EMS calls, they could reduce the need for new real-world data collection in privacy-sensitive settings.
The pipeline might be adapted to create training examples for related tasks such as triage prioritization or handoff communication.
Wider use of such synthetic data could accelerate development of systems that support live decision-making in high-stakes conversations.

Load-bearing premise

The synthetic dialogues must be realistic and free of systematic artifacts so that training gains transfer to real EMS conversations instead of appearing only on synthetic test data.

What would settle it

Evaluate a diagnosis-prediction model trained with and without EMSDialog on a held-out collection of real, non-synthetic EMS conversations and check whether accuracy, timeliness, and stability improve with the synthetic data.

Figures

Figures reproduced from arXiv: 2604.07549 by Anthony Cortez, Homa Alemzadeh, Sahil Murtaza, Xueren Ge.

**Figure 2.** Figure 2: a) ePCR; b) EMS Topic Flow; c) Synthetic Dialogue Generation Pipeline; d) Synthetic Dialogue Example. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation results. (a-b) Downstream forecasting performance: last accuracy and edit overheads. (c-d) Conversation-level and utterance-level evaluation. Effectiveness of PLANNER. Adding PLANNER (Plan→Generate) leads to a clear improvement in logical structure (∆ = 38.4%) compared with using GENERATOR alone, indicating that planning primarily enhances coherence and flow. Consistently, models trained on resu… view at source ↗

**Figure 4.** Figure 4: Ablation Study: Conversational Diagnosis Prediction Performance [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation Study on Checker: Extrinsic Evaluation. Con: Concept Checker Only. TF: Topic Flow Checker [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation Study on Checker: Intrinsic Evalua [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Conversation-level LLM evaluation: prompt used to judge logical structure [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Conversation-level LLM evaluation: prompt used to judge ranking [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Utterance-level LLM evaluation: prompt used to judge realism [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Utterance-level LLM evaluation: prompt used to judge safety [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Utterance-level LLM evaluation: prompt used to judge role accuracy [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Utterance-level LLM evaluation prompt used to judge groundedness [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Style Critic prompt used for providing style critics [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Planner prompt used for generating structured EMS dialogue plans. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Generator prompt used for generating EMS dialogues [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Refiner prompt used for refining the EMS dialogue styles [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Rules authored by EMS Experts [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: 0-Shot Prompting for Synthetic Dialogue Generation [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: 0-Shot + Rules Prompting for Synthetic Dialogue Generation [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗

**Figure 20.** Figure 20: CoT Prompting for Synthetic Dialogue Generation [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

**Figure 21.** Figure 21: CoT + Rules Prompting for Synthetic Dialogue Generation [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

read the original abstract

Conversational diagnosis prediction requires models to track evolving evidence in streaming clinical conversations and decide when to commit to a diagnosis. Existing medical dialogue corpora are largely dyadic or lack the multi-party workflow and annotations needed for this setting. We introduce an ePCR-grounded, topic-flow-based multi-agent generation pipeline that iteratively plans, generates, and self-refines dialogues with rule-based factual and topic flow checks. The pipeline yields EMSDialog, a dataset of 4,414 synthetic multi-speaker EMS conversations based on a real-world ePCR dataset, annotated with 43 diagnoses, speaker roles, and turn-level topics. Human and LLM evaluations confirm high quality and realism of EMSDialog using both utterance- and conversation-level metrics. Results show that EMSDialog-augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction. Our datasets and code are publicly available at https://uva-dsa.github.io/EMSDialog

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EMSDialog gives a new synthetic multi-speaker EMS dataset grounded in real ePCR reports, but the claimed gains on diagnosis prediction need confirmation that the test set is real rather than more synthetic output.

read the letter

The paper introduces EMSDialog, a collection of 4,414 synthetic multi-speaker EMS conversations generated from real electronic patient care reports. The pipeline uses multiple LLM agents to plan topic flows, generate turns, self-refine, and apply rule-based checks for facts and structure. It also reports that models trained with this data show better accuracy, timeliness, and stability on conversational diagnosis prediction, with human and LLM checks supporting the dialogues' realism. The dataset and code are released publicly, which is useful on its own. What the work does well is target the multi-party gap that most medical dialogue corpora ignore. Dyadic setups dominate the literature, so grounding generation in actual ePCR records and adding speaker roles plus turn-level topics is a direct response to that limitation. The combination of topic-flow planning and separate factual checks gives the pipeline more structure than pure LLM prompting. The soft spot is the downstream evaluation. The abstract does not say whether the test conversations for the prediction task are real EMS dialogues or additional outputs from the same multi-LLM process. If they share the same topic rules, refinement steps, and stylistic patterns, the measured improvements could come from the model fitting the synthetic distribution instead of learning to track real streaming evidence across speakers. The human and LLM realism scores are mentioned but lack sample sizes or agreement numbers, so their weight is hard to judge from the abstract alone. This is for researchers building conversational systems in emergency medicine or other multi-party clinical settings who need annotated multi-speaker data. The dataset itself is worth a look even if the prediction results require tighter testing on independent real conversations. It deserves peer review because the generation approach is concrete and the data gap is genuine, though referees should press on the test-set provenance and any ablations that separate synthetic artifacts from actual modeling gains.

Referee Report

2 major / 3 minor

Summary. The paper introduces EMSDialog, a dataset of 4,414 synthetic multi-speaker EMS dialogues generated from real ePCR records via a multi-LLM agent pipeline that performs topic-flow planning, dialogue generation, self-refinement, and rule-based factual/topic checks. The dialogues are annotated with 43 diagnoses, speaker roles, and turn-level topics. Human and LLM evaluations are used to confirm high quality and realism at utterance and conversation levels. The central empirical claim is that augmenting training data with EMSDialog improves accuracy, timeliness, and stability of conversational diagnosis prediction models.

Significance. If the reported gains hold when evaluated on independently collected real EMS conversations, the work would provide a practical, scalable method for creating annotated multi-party medical dialogue data in a domain where such resources are scarce. The public release of the dataset and code is a clear strength that supports reproducibility and further research on conversational diagnosis in emergency settings.

major comments (2)

[Abstract] Abstract (results paragraph): The claim that EMSDialog-augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction is load-bearing, yet the abstract does not state the provenance of the test split used for this task. If the test dialogues were also produced by the same multi-LLM pipeline (with identical topic-flow rules and self-refinement), measured gains could reflect distribution matching rather than improved modeling of evolving clinical evidence.
[Evaluation] Evaluation section: The pipeline relies on LLM self-refinement for generation and LLM-based scoring for quality assessment; this introduces a modest circularity risk that could inflate perceived realism. The manuscript should report the fraction of dialogues that required human override or external factual verification to demonstrate that quality is not solely LLM-internal.

minor comments (3)

[Abstract] The abstract states that 43 diagnoses are annotated but neither lists them nor reports their frequency distribution; adding this information would help readers assess coverage of the prediction task.
[Results] Specific quantitative results (e.g., exact accuracy deltas, timeliness metrics, stability measures, inter-annotator agreement for human evaluations) are referenced but not shown in the abstract; these should appear in the main results tables or figures with confidence intervals.
[Pipeline] The manuscript should clarify the exact rule-based checks (factual consistency and topic-flow) and whether they are fully deterministic or still require LLM assistance, as this affects the degree of automation claimed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's careful reading and constructive suggestions. Below we respond to each major comment and describe the changes we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (results paragraph): The claim that EMSDialog-augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction is load-bearing, yet the abstract does not state the provenance of the test split used for this task. If the test dialogues were also produced by the same multi-LLM pipeline (with identical topic-flow rules and self-refinement), measured gains could reflect distribution matching rather than improved modeling of evolving clinical evidence.

Authors: Thank you for this observation. The abstract indeed omits the test set details for brevity. In the manuscript's Evaluation section, we describe that the diagnosis prediction models are evaluated on a held-out test split from EMSDialog (approximately 20% of the data), where the dialogues were generated with varied random seeds and topic sequences to increase diversity. We will update the results paragraph in the abstract to specify the test set provenance. While we recognize the referee's concern about distribution matching, the observed gains in timeliness and stability still demonstrate the value of the synthetic data for training robust models. revision: yes
Referee: [Evaluation] Evaluation section: The pipeline relies on LLM self-refinement for generation and LLM-based scoring for quality assessment; this introduces a modest circularity risk that could inflate perceived realism. The manuscript should report the fraction of dialogues that required human override or external factual verification to demonstrate that quality is not solely LLM-internal.

Authors: We share the concern about potential circularity in LLM-based generation and evaluation. The pipeline already incorporates rule-based factual and topic-flow checks in addition to LLM self-refinement and human evaluations. We will revise the Evaluation section to report the fraction of dialogues that required human override or external factual verification, providing greater transparency on the extent of non-LLM quality controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical result independent of generation pipeline

full rationale

The paper presents a multi-LLM agent pipeline for generating synthetic EMS dialogues grounded in real ePCR data, using iterative planning, generation, self-refinement, and separate rule-based factual/topic-flow checks. Quality is assessed via independent human and LLM evaluations at utterance and conversation levels. The central claim—an observed improvement in accuracy, timeliness, and stability of conversational diagnosis prediction after EMSDialog-augmented training—is an empirical experimental outcome, not a derivation that reduces by construction to the generation rules, fitted parameters, or self-citations. No equations, self-definitional steps, or load-bearing self-citation chains appear; the test-set provenance concern is a generalization issue rather than circularity per the strict criteria requiring explicit reduction (e.g., Eq. X = Eq. Y or renamed fit). The derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that LLMs guided by structured reports and topic plans can produce medically plausible multi-party dialogues; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Large language models can collaboratively plan, generate, and self-refine multi-speaker dialogues that maintain factual consistency with source ePCR reports and follow realistic topic flows.
This assumption underpins the iterative planning-generation-refinement pipeline and the claim that the resulting dialogues are suitable for training diagnosis models.

pith-pipeline@v0.9.0 · 5474 in / 1333 out tokens · 51337 ms · 2026-05-10T17:47:36.517273+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages · 2 internal anchors

[1]

LLMs Get Lost In Multi-Turn Conversation

Not so fast, classifier – accuracy and entropy reduction in incremental intent classification. InPro- ceedings of the 3rd Workshop on Natural Language Processing for Conversational AI, pages 52–67, On- line. Association for Computational Linguistics. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, ...

work page internal anchor Pith review arXiv 2022
[2]

InProceedings of the 44th International ACM SIGIR Conference on Re- search and Development in Information Retrieval, pages 544–554

Semi-supervised variational reasoning for medical dialogue generation. InProceedings of the 44th International ACM SIGIR Conference on Re- search and Development in Information Retrieval, pages 544–554. Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W Koh, and Yulia Tsvetkov. 2024. Mediq: Question-asking llms and a...

work page arXiv 2024
[3]

M Arif Rahman, Sarah M Preum, Ronald Williams, Homa Alemzadeh, and John A Stankovic

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. M Arif Rahman, Sarah M Preum, Ronald Williams, Homa Alemzadeh, and John A Stankovic. 2020. Grace: generating summary reports automatically for cognitive assistance in emergency response. In Proceedings of the AA...

work page arXiv 2020
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Anthony B Costa, Mona G Flores, and 1 others. 2022. A large language model for electronic health records.NPJ digital medicine, 5(1):194. Jiaqing Yuan and Munindar P Singh. 2023. C...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Meng Zhou, Zechen Li, Bowen Tan, Guangtao Zeng, Wenmian Yang, Xuehai He, Zeqian Ju, Subrato Chakravorty, Shu Chen, Xingyi Yang, and 1 others

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Meng Zhou, Zechen Li, Bowen Tan, Guangtao Zeng, Wenmian Yang, Xuehai He, Zeqian Ju, Subrato Chakravorty, Shu Chen, Xingyi Yang, and 1 others
[6]

On the generation of medical dialogs for covid-
[7]

Chief Complaints

InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). A Appendix A.1 EMS Topic Flow We present the detailed EMS topic flow of Fig. 2 below. Topic names are shown inbold, and the steps within each topic are listed in pare...

2017
[8]

A reference exemplar dialogue with ground-truth roles
[9]

The dialogue is EMS-related

The full dialogue under evaluation (for context). The dialogue is EMS-related
[10]

yes" if the claimed role is correct for this utterance in context; otherwise

The utterance with its claimed role. Task: Return "yes" if the claimed role is correct for this utterance in context; otherwise "no". Provide one-sentence why. Return JSON only. Schema: {{ "utt_id": int, "role": {{ "yes_no": "yes" | "no", "why": str }} }} Reference exemplar dialogue (ground-truth roles): {role_exemplar} Full dialogue under evaluation: {fu...
[11]

Identify concise medical/EMS concepts or claims in the utterance (short phrases)
[12]

For each concept, assign support: exact | semantic | inferable | none
[13]

yes" if ALL key concepts are supported (exact/semantic/inferable) and no major unsupported claim exists. -

Decide groundedness_yes_no: - "yes" if ALL key concepts are supported (exact/semantic/inferable) and no major unsupported claim exists. - "no" if any major concept/claim is unsupported ("none"). Return JSON only. Schema: {{ "utt_id": int, "groundedness": {{ "yes_no": "yes" | "no", "concepts": [{{"concept": str, "support": "exact|semantic|inferable|none"}}...
[14]

Critique the dialogue against the rules: List specific, fixable issues (grounding, order, speakers, style, realism cues, safety, formatting)
[15]

critiques

Return <critique>["critiques"]</critique>, and <approved>true|false</approved>. If all hard constraints are satisfied, output true within <approved>true</approved> otherwise false. Formatting (STRICT): Output ONLY the following tagged blocks(<approved>true|false</approved>, <critique>...</critique>). Do not include these delimiters inside any field values...
[16]

- Return these blocks (no extra text, no code fences): <approved>true|false</approved> <critique>

</critique> User: Topic Flow: {topic flow} EPCR (ground truth): {epcr} DIALOGUE (review): {dialogue} Instructions: - Evaluate groundedness (no invented facts), speaker set, style, realism cues, and safety. - Return these blocks (no extra text, no code fences): <approved>true|false</approved> <critique>
[17]

consumed

</critique> Figure 13: Style Critic prompt used for providing style critics System: You are an EMS dialogue planner. Goal: Produce a conversation PLAN (not the final prose) that follows the Medical Topic Flow and realistic Time Flow. The plan is a sequence of tuples that a simulator can turn into utterances later; each tuple is tagged with: - topic (from ...
[21]

Take Vital Signs; bp; Partner: Ma’am, we’re going to take your blood pressure now. </dialogue> User: ePCR (ground truth): {epcr} PLAN (topic, micro_intent, evidence): {plan} Figure 15: Generator prompt used for generating EMS dialogues System: Your task is to edit and improve the conversation to make it more realistic. The number of utterance should be at...
[22]

Dispatch; radio_dispatch; dispatcher: Dispatch to Unit 3 responding for chest pain
[23]

What made you call 911 today?

Introduction; introduction; EMT: Hi, I’m Alex, an EMT with the rescue squad. What made you call 911 today?
[24]

Chief Complaint; identify_primary_complaint; Patient: Uh, chest pain and shortness of breath, started about 30 minutes ago
[25]

Can you tell me your medical history?

Take Vital Signs; bp; Partner: Ma’am, we’re going to take your blood pressure now. </dialogue> User: Topic Flow: {topic flow} ePCR (ground truth): {epcr} dialogue: {dialogue} First think step by step to criticize the dialogue based on suggestions, then return ONLY newline-delimited records matching <turn>. <Topic>; <micro_intent>; <Role>: <utterance>. Fig...