Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning

Fenglin Liu; Jiayuan Zhu; Jiazhen Pan; Junde Wu; Yuyuan Liu

arxiv: 2502.07143 · v3 · submitted 2025-02-11 · 💻 cs.CL

Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning

Jiayuan Zhu , Jiazhen Pan , Yuyuan Liu , Fenglin Liu , Junde Wu This is my paper

Pith reviewed 2026-05-23 03:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords medical dialogueLLM healthcare assistantBayesian active learningempathetic interactiondiagnostic uncertaintygrounded reasoningmedical guidelinesconversation benchmark

0 comments

The pith

APP lets LLMs run multi-turn medical dialogues with empathy and Bayesian updates on guidelines, raising diagnostic accuracy over one-shot and multi-turn baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ask Patients with Patience (APP), a multi-turn LLM medical assistant that elicits symptoms through empathetic dialogue, applies Bayesian active learning to manage uncertainty transparently, and grounds its outputs in verified medical guidelines. It introduces a benchmark that simulates conversations using patient agents built from real consultation profiles, then shows APP outperforming state-of-the-art one-shot and multi-turn LLM baselines on diagnostic accuracy, uncertainty reduction, and user experience. A sympathetic reader would care because the approach directly targets the doctor shortage by making AI assistants more reliable and human-like in clinical settings.

Core claim

APP is a multi-turn LLM framework for medical assistance that improves communication by eliciting user symptoms through empathetic dialogue, incorporates Bayesian active learning to support transparent and adaptive diagnoses, and is built on verified medical guidelines to ensure clinically grounded reasoning. On a new benchmark of simulated medical conversations driven by profiles from real-world cases, APP improves diagnostic accuracy, reduces uncertainty, and enhances user experience compared with SOTA one-shot and multi-turn LLM baselines.

What carries the argument

APP, which pairs empathetic multi-turn symptom elicitation with Bayesian active learning on verified medical guidelines to produce transparent, adaptive diagnoses in LLMs.

If this is right

LLMs can handle diagnostic uncertainty more transparently in multi-turn settings.
Medical guidelines can be used directly to ground LLM outputs for evidence-based reasoning.
Empathetic dialogue improves both information gathering and patient engagement in AI medical assistance.
Simulated patient profiles extracted from real cases provide a scalable way to benchmark medical dialogue systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same combination of empathy and Bayesian updating could be tested in other high-stakes dialogue domains such as legal intake or financial advising.
Real-world deployment would likely require additional safeguards for privacy and liability not addressed in the simulated benchmark.
Integration with live electronic health records could supply the Bayesian updates with patient-specific priors beyond the guideline set.

Load-bearing premise

The benchmark that simulates realistic medical conversations using patient agents driven by profiles extracted from real-world consultation cases accurately represents real patient interactions and enables fair evaluation.

What would settle it

A head-to-head trial in which APP is run with real patients instead of simulated agents and shows no gain in accuracy or user ratings over the same baselines.

Figures

Figures reproduced from arXiv: 2502.07143 by Fenglin Liu, Jiayuan Zhu, Jiazhen Pan, Junde Wu, Yuyuan Liu.

**Figure 2.** Figure 2: APP Workflow. The system first maps dialogue [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An APP case study of human-centric multi-turn dialogue based on medical guidelines. The estimated [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Entropy Comparison across Iterations. Dr.APP consistently shows a sharper decrease in entropy, indicating increased diagnostic confidence and reduced uncertainty through iterative dialogues. 3.6 Confidence Analysis across Iterations [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Confidence Analysis across Iterations. APPDeepSeek-v3 shows increased confidence in the top predicted disease while reducing confidence in less likely conditions over multiple iterations, demonstrating improved diagnostic confidence with interpretability. a clearer separation in probability rankings. The widening gap suggests that Dr.APP systematically refines its predictions, improving diagnostic clari… view at source ↗

read the original abstract

The severe shortage of medical doctors limits access to timely and reliable healthcare, leaving millions underserved. Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions. Many LLMs are not grounded in authoritative medical guidelines and fail to transparently manage diagnostic uncertainty. Their language is often rigid and mechanical, lacking the human-like qualities essential for patient trust. To address these challenges, we propose Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction. APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement. It also incorporates Bayesian active learning to support transparent and adaptive diagnoses. The framework is built on verified medical guidelines, ensuring clinically grounded and evidence-based reasoning. To evaluate its performance, we develop a new benchmark that simulates realistic medical conversations using patient agents driven by profiles extracted from real-world consultation cases. We compare APP against SOTA one-shot and multi-turn LLM baselines. The results show that APP improves diagnostic accuracy, reduces uncertainty, and enhances user experience. By integrating medical expertise with transparent, human-like interaction, APP bridges the gap between AI-driven medical assistance and real-world clinical practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APP bundles empathetic multi-turn prompting with Bayesian active learning and guideline grounding, then measures gains on a new simulated-patient benchmark whose realism is unverified.

read the letter

The core of this paper is a medical dialogue system called APP. It runs multi-turn conversations that start with empathetic symptom elicitation, then uses Bayesian active learning to update diagnostic probabilities while staying tied to verified guidelines. The authors also release a benchmark built from profiles taken out of real consultation records and turned into patient agents. That combination is what they test against one-shot and basic multi-turn LLM baselines, claiming higher accuracy, lower uncertainty, and better user experience scores.

Referee Report

1 major / 1 minor

Summary. The paper proposes Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant that elicits symptoms through empathetic dialogue, incorporates Bayesian active learning for transparent and adaptive diagnoses, and grounds reasoning in verified medical guidelines. It introduces a new benchmark of simulated medical conversations using patient agents driven by profiles extracted from real-world consultation cases, and reports that APP improves diagnostic accuracy, reduces uncertainty, and enhances user experience relative to SOTA one-shot and multi-turn LLM baselines.

Significance. If the benchmark is shown to faithfully reproduce the information structure, uncertainty, and interaction dynamics of real patient consultations, the framework could provide a useful template for building more reliable and patient-trusted LLM medical assistants, particularly by addressing common failures in uncertainty management and mechanical language.

major comments (1)

[Benchmark section] Benchmark section: The central performance claims rest on a new simulated benchmark whose patient agents are 'driven by profiles extracted from real-world consultation cases.' The manuscript supplies no inter-annotator agreement on profile extraction, no expert ratings of conversation realism, and no ablation on agent-prompting variations to verify that partial disclosure, emotional consistency, and resistance to leading questions are reproduced. Without such evidence the measured gains cannot be confidently attributed to APP's grounded reasoning rather than benchmark artifacts.

minor comments (1)

[Abstract] Abstract: The abstract states that APP 'improves diagnostic accuracy, reduces uncertainty, and enhances user experience' but omits any mention of the concrete metrics, statistical tests, or number of conversations used; a one-sentence summary of these quantities would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for stronger validation of the simulated benchmark. We address this point directly below and outline planned revisions to strengthen the evidence that performance gains are attributable to APP rather than benchmark artifacts.

read point-by-point responses

Referee: [Benchmark section] Benchmark section: The central performance claims rest on a new simulated benchmark whose patient agents are 'driven by profiles extracted from real-world consultation cases.' The manuscript supplies no inter-annotator agreement on profile extraction, no expert ratings of conversation realism, and no ablation on agent-prompting variations to verify that partial disclosure, emotional consistency, and resistance to leading questions are reproduced. Without such evidence the measured gains cannot be confidently attributed to APP's grounded reasoning rather than benchmark artifacts.

Authors: We agree that the manuscript currently lacks formal inter-annotator agreement statistics for profile extraction and independent expert ratings of conversation realism. The profiles were derived from real consultation transcripts using a structured extraction protocol aligned with standard medical history templates, but we did not report agreement metrics or external validation. We will add these in the revision: (1) inter-annotator agreement (Cohen's kappa) on a held-out subset of profile extractions, and (2) ratings from two independent clinicians on realism dimensions including partial disclosure, emotional consistency, and resistance to leading questions. We will also include an ablation on agent-prompting variations (e.g., different levels of resistance and emotional expressiveness) to show that APP's relative gains persist. These additions will allow readers to assess whether the benchmark faithfully reproduces real interaction dynamics. We note that the reported improvements include uncertainty reduction via Bayesian active learning, a metric less susceptible to superficial prompting artifacts, but we accept that additional validation is required to fully support the claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes the APP framework for medical dialogue and reports empirical improvements on a newly constructed benchmark whose patient agents are driven by profiles extracted from real-world cases. No equations, parameter fits, or predictions are described; no self-citations are invoked as load-bearing premises; and no step reduces by construction to its own inputs. The evaluation is therefore self-contained against an externally motivated benchmark rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are specified.

pith-pipeline@v0.9.0 · 5760 in / 1108 out tokens · 33151 ms · 2026-05-23T03:33:58.006454+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support
cs.AI 2026-05 unverdicted novelty 5.0

Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Nature, pages 1–7

Towards accurate differential diagnosis with large language models. Nature, pages 1–7. MSD Manual. 2025a. Msd manual consumer version. MSD Manual. 2025b. Msd manual professional edition. Robert Osazuwa Ness, Katie Matton, Hayden Helm, Sheng Zhang, Junaid Bajwa, Carey E Priebe, and Eric Horvitz. 2024. Medfuzz: Exploring the robustness of large language mod...

work page arXiv 2024
[2]

arXiv preprint arXiv:2311.16452

Can generalist foundation models outcom- pete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

work page arXiv
[3]

Advances in neural in- formation processing systems, 35:27730–27744

Training language models to follow instruc- tions with human feedback. Advances in neural in- formation processing systems, 35:27730–27744. Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. 2025. Medvlm-r1: Incentiviz- ing medical reasoning capability of vision-language models (vlms) v...

work page arXiv 2025
[4]

Nature Medicine, pages 1–8

Toward expert-level medical question answer- ing with large language models. Nature Medicine, pages 1–8. Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. 2025. To- wards conversational diagnostic artificial intelligence. Nature, pages 1–9. 10 Ehsan Ullah, Anil Parwani, ...

work page arXiv 2025
[5]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Hallucination is inevitable: An innate lim- itation of large language models. arXiv preprint arXiv:2401.11817. Guojun Yan, Jiahuan Pei, Pengjie Ren, Zhaochun Ren, Xin Xin, Huasheng Liang, Maarten de Rijke, and Zhumin Chen. 2022. Remedi: Resources for multi- domain, multi-service, medical dialogues. In Pro- ceedings of the 45th International ACM SIGIR Con-...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

In Pro- ceedings of the AAAI conference on artificial intelli- gence, volume 38, pages 19368–19376

Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. In Pro- ceedings of the AAAI conference on artificial intelli- gence, volume 38, pages 19368–19376. Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhi- hong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qin...

work page arXiv 2023
[7]

2: Related, but unlikely to be useful

Diagnosis Accuracy • How accurate is the model’s predicted diagnosis compared to the actual diagno- sis? • Rating Scale: 0: Completely unrelated to the actual diagnosis. 2: Related, but unlikely to be useful. 3: Closely related – may still be helpful. 4: Very close – minor difference but clinically similar. 5: Exact match with the ground truth diagnosis. ...

work page
[8]

2: Mostly incorrect - with major inaccura- cies

Reliability Score (Rel.) • Does the model’s predicted disease align with verified medical knowledge? • Rating Scale: 1: Completely incor- rect - contradicts medical guidelines. 2: Mostly incorrect - with major inaccura- cies. 3: Partially correct - but has some errors. 4: Mostly accurate - only mi- nor inconsistencies. 5: Fully accurate - aligns with esta...

work page
[9]

2: Poor – minimal effort to build trust

Fostering the Relationship (FR) • How would you rate the model’s behav- ior in fostering a relationship with the patient? • Rating Scale: 1: Very poor – no rap- port or engagement. 2: Poor – minimal effort to build trust. 3: Fair – some ac- knowledgment but limited warmth. 4: Good – shows care and encourages con- nection. 5: Excellent – empathetic, re- sp...

work page
[10]

2: Poor – asks limited or irrelevant questions

Gathering Information (GI) • How would you rate the model’s ability to gather relevant information from the patient? • Rating Scale: 1: Very poor – fails to gather necessary details. 2: Poor – asks limited or irrelevant questions. 3: Fair – gathers some useful information. 4: Good – asks mostly appropriate and clear questions. 5: Excellent – thoroughly el...

work page
[11]

2: Poor – hard to follow or overly technical

Providing Information (PI) • How would you rate the model’s ability to provide understandable and accurate information to the patient? • Rating Scale: 1: Very poor – unclear or incorrect information. 2: Poor – hard to follow or overly technical. 3: Fair – mostly understandable but lacks clarity. 4: Good – clear with some complexity. 5: Excellent – clear, ...

work page
[12]

2: Mostly difficult - re- quire effort to interpret

Accessibility Score (Acc.) • How easy is it for you to understand the question posed by the model? • Rating Scale: 1: Very difficult - full of medical jargon. 2: Mostly difficult - re- quire effort to interpret. 3: Somewhat clear - but have some medical terms that may be confusing. 4: Mostly clear - only minor terminology issues. 5: Com- pletely clear - n...

work page
[13]

2: Somewhat cold - lit- tle acknowledgment of concerns

Empathy Score (Emp.) • How empathetic does the model feel to you during the conversation? • Rating Scale: 1: Completely robotic - no sense of empathy. 2: Somewhat cold - lit- tle acknowledgment of concerns. 3: Neu- tral - acknowledges concerns but lacks warmth. 4: Shows care and reassurance - with some empathetic responses. 5: Very empathetic - makes you ...

work page
[14]

2: Partially answers - but lacks detail

Relevant Response Rate (RRR) • Does the model directly answer your follow-up questions before moving on? • Rating Scale: 1: Completely ignores the question or gives an irrelevant response. 2: Partially answers - but lacks detail. 3: Answers the question - but may miss key points. 4: Mostly relevant - only minor gaps. 5: Fully relevant -directly answers wi...

work page
[15]

2: Poor – limited openness or empathy

Fostering the Relationship (FR) • How would you rate the model’s behav- ior in fostering a relationship during the interaction? • Rating Scale: 1: Very poor – no rapport, closed-off. 2: Poor – limited openness or empathy. 3: Fair – acknowledges pa- tient but lacks warmth. 4: Good – shows care and builds some trust. 5: Excellent – builds connection, respec...

work page

[1] [1]

Nature, pages 1–7

Towards accurate differential diagnosis with large language models. Nature, pages 1–7. MSD Manual. 2025a. Msd manual consumer version. MSD Manual. 2025b. Msd manual professional edition. Robert Osazuwa Ness, Katie Matton, Hayden Helm, Sheng Zhang, Junaid Bajwa, Carey E Priebe, and Eric Horvitz. 2024. Medfuzz: Exploring the robustness of large language mod...

work page arXiv 2024

[2] [2]

arXiv preprint arXiv:2311.16452

Can generalist foundation models outcom- pete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

work page arXiv

[3] [3]

Advances in neural in- formation processing systems, 35:27730–27744

Training language models to follow instruc- tions with human feedback. Advances in neural in- formation processing systems, 35:27730–27744. Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. 2025. Medvlm-r1: Incentiviz- ing medical reasoning capability of vision-language models (vlms) v...

work page arXiv 2025

[4] [4]

Nature Medicine, pages 1–8

Toward expert-level medical question answer- ing with large language models. Nature Medicine, pages 1–8. Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. 2025. To- wards conversational diagnostic artificial intelligence. Nature, pages 1–9. 10 Ehsan Ullah, Anil Parwani, ...

work page arXiv 2025

[5] [5]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Hallucination is inevitable: An innate lim- itation of large language models. arXiv preprint arXiv:2401.11817. Guojun Yan, Jiahuan Pei, Pengjie Ren, Zhaochun Ren, Xin Xin, Huasheng Liang, Maarten de Rijke, and Zhumin Chen. 2022. Remedi: Resources for multi- domain, multi-service, medical dialogues. In Pro- ceedings of the 45th International ACM SIGIR Con-...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

In Pro- ceedings of the AAAI conference on artificial intelli- gence, volume 38, pages 19368–19376

Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. In Pro- ceedings of the AAAI conference on artificial intelli- gence, volume 38, pages 19368–19376. Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhi- hong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qin...

work page arXiv 2023

[7] [7]

2: Related, but unlikely to be useful

Diagnosis Accuracy • How accurate is the model’s predicted diagnosis compared to the actual diagno- sis? • Rating Scale: 0: Completely unrelated to the actual diagnosis. 2: Related, but unlikely to be useful. 3: Closely related – may still be helpful. 4: Very close – minor difference but clinically similar. 5: Exact match with the ground truth diagnosis. ...

work page

[8] [8]

2: Mostly incorrect - with major inaccura- cies

Reliability Score (Rel.) • Does the model’s predicted disease align with verified medical knowledge? • Rating Scale: 1: Completely incor- rect - contradicts medical guidelines. 2: Mostly incorrect - with major inaccura- cies. 3: Partially correct - but has some errors. 4: Mostly accurate - only mi- nor inconsistencies. 5: Fully accurate - aligns with esta...

work page

[9] [9]

2: Poor – minimal effort to build trust

Fostering the Relationship (FR) • How would you rate the model’s behav- ior in fostering a relationship with the patient? • Rating Scale: 1: Very poor – no rap- port or engagement. 2: Poor – minimal effort to build trust. 3: Fair – some ac- knowledgment but limited warmth. 4: Good – shows care and encourages con- nection. 5: Excellent – empathetic, re- sp...

work page

[10] [10]

2: Poor – asks limited or irrelevant questions

Gathering Information (GI) • How would you rate the model’s ability to gather relevant information from the patient? • Rating Scale: 1: Very poor – fails to gather necessary details. 2: Poor – asks limited or irrelevant questions. 3: Fair – gathers some useful information. 4: Good – asks mostly appropriate and clear questions. 5: Excellent – thoroughly el...

work page

[11] [11]

2: Poor – hard to follow or overly technical

Providing Information (PI) • How would you rate the model’s ability to provide understandable and accurate information to the patient? • Rating Scale: 1: Very poor – unclear or incorrect information. 2: Poor – hard to follow or overly technical. 3: Fair – mostly understandable but lacks clarity. 4: Good – clear with some complexity. 5: Excellent – clear, ...

work page

[12] [12]

2: Mostly difficult - re- quire effort to interpret

Accessibility Score (Acc.) • How easy is it for you to understand the question posed by the model? • Rating Scale: 1: Very difficult - full of medical jargon. 2: Mostly difficult - re- quire effort to interpret. 3: Somewhat clear - but have some medical terms that may be confusing. 4: Mostly clear - only minor terminology issues. 5: Com- pletely clear - n...

work page

[13] [13]

2: Somewhat cold - lit- tle acknowledgment of concerns

Empathy Score (Emp.) • How empathetic does the model feel to you during the conversation? • Rating Scale: 1: Completely robotic - no sense of empathy. 2: Somewhat cold - lit- tle acknowledgment of concerns. 3: Neu- tral - acknowledges concerns but lacks warmth. 4: Shows care and reassurance - with some empathetic responses. 5: Very empathetic - makes you ...

work page

[14] [14]

2: Partially answers - but lacks detail

Relevant Response Rate (RRR) • Does the model directly answer your follow-up questions before moving on? • Rating Scale: 1: Completely ignores the question or gives an irrelevant response. 2: Partially answers - but lacks detail. 3: Answers the question - but may miss key points. 4: Mostly relevant - only minor gaps. 5: Fully relevant -directly answers wi...

work page

[15] [15]

2: Poor – limited openness or empathy

Fostering the Relationship (FR) • How would you rate the model’s behav- ior in fostering a relationship during the interaction? • Rating Scale: 1: Very poor – no rapport, closed-off. 2: Poor – limited openness or empathy. 3: Fair – acknowledges pa- tient but lacks warmth. 4: Good – shows care and builds some trust. 5: Excellent – builds connection, respec...

work page