Multi-agent Self-triage System with Medical Flowcharts

Alexander Qian; Edward J. Wang; Gi Won Choi; Hongyue Jin; Jessica Wen; Lianhui Qin; Mattheus Ramsis; Sophia Yu; Terrence Lee; Xin Liu

arxiv: 2511.12439 · v2 · submitted 2025-11-16 · 💻 cs.AI · cs.MA

Multi-agent Self-triage System with Medical Flowcharts

Yujia Liu , Sophia Yu , Hongyue Jin , Jessica Wen , Alexander Qian , Terrence Lee , Mattheus Ramsis , Gi Won Choi

show 3 more authors

Lianhui Qin Xin Liu Edward J. Wang

This is my paper

Pith reviewed 2026-05-17 21:50 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent systemsself-triagemedical flowchartsconversational AIlarge language modelshealthcare decision supportsynthetic evaluation

0 comments

The pith

Multi-agent system guides LLMs through 100 medical flowcharts to reach 95% retrieval accuracy and 99% navigation accuracy on simulated conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a conversational self-triage system that anchors large language models to 100 clinically validated flowcharts from the American Medical Association. A three-part multi-agent structure handles flowchart retrieval from patient descriptions, interpretation of answers against flowchart logic, and delivery of clear recommendations. Evaluation on two large synthetic datasets of simulated patient exchanges produced 95.29 percent top-3 retrieval accuracy and 99.10 percent navigation accuracy. The design aims to replace unverified free-text answers with traceable steps drawn from established protocols.

Core claim

A multi-agent framework that assigns separate roles to retrieval, decision, and chat agents can steer LLMs to select the correct flowchart from 100 options and then follow its branches accurately across varied conversational inputs, as shown by 95.29 percent top-3 accuracy on 2,000 queries and 99.10 percent navigation success on 37,200 interactions generated from synthetic data.

What carries the argument

The multi-agent framework with a retrieval agent to pick the right flowchart, a decision agent to interpret responses against flowchart rules, and a chat agent to output patient-friendly guidance.

If this is right

The approach supplies an auditable record of each recommendation because every step traces back to a specific flowchart question.
Combining free-text input with fixed clinical protocols can reduce the chance that an LLM invents medical advice.
If scaled, the system could direct patients toward appropriate care levels and ease pressure on emergency and primary services.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world deployment would require testing with actual patient phrasing and emotional states not present in the synthetic sets.
The same retrieval-plus-navigation pattern could be adapted to other protocol-driven domains such as legal intake or financial eligibility screening.
Integration with electronic health records might allow the system to pre-populate patient history into the flowchart questions.

Load-bearing premise

Performance measured on synthetic datasets of simulated conversations will generalize to real patients who use varied language, express emotions, omit details, or present cases outside the covered flowcharts.

What would settle it

A trial in which real patients interact with the system and the rate of correct flowchart selection or correct navigation falls substantially below the reported synthetic figures.

read the original abstract

Online health resources and large language models (LLMs) are increasingly used as a first point of contact for medical decision-making, yet their reliability in healthcare remains limited by low accuracy, lack of transparency, and susceptibility to unverified information. We introduce a proof-of-concept conversational self-triage system that guides LLMs with 100 clinically validated flowcharts from the American Medical Association, providing a structured and auditable framework for patient decision support. The system leverages a multi-agent framework consisting of a retrieval agent, a decision agent, and a chat agent to identify the most relevant flowchart, interpret patient responses, and deliver personalized, patient-friendly recommendations, respectively. Performance was evaluated at scale using synthetic datasets of simulated conversations. The system achieved 95.29% top-3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation across varied conversational styles and conditions (N=37,200). By combining the flexibility of free-text interaction with the rigor of standardized clinical protocols, this approach demonstrates the feasibility of transparent, accurate, and generalizable AI-assisted self-triage, with potential to support informed patient decision-making while improving healthcare resource utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

High synthetic accuracies for a three-agent flowchart triage system, but the real-patient generalization gap is the main limit on what the results show.

read the letter

Here's the quick take on this arXiv paper: they get very high accuracy numbers for retrieving and navigating through 100 AMA medical flowcharts using a three-agent LLM system, but all of it is measured on synthetic patient conversations. The retrieval agent hits 95.29% top-3 accuracy on 2,000 cases, and the navigation piece reaches 99.10% across 37,200 varied simulated exchanges. That is the concrete result to note first. What the work does well is put together a practical split of labor—retrieval to pick the flowchart, decision to step through the logic, and chat to handle the user side—while grounding everything in actual AMA-validated protocols rather than open-ended generation. The scale of the synthetic tests and the check across different conversational styles give the numbers some weight, and the setup avoids obvious circularity since the metrics come from separately generated data. The soft spot that stands out is the evaluation itself. Everything rests on simulated conversations, so we still lack evidence on how the system handles real patients who ramble, contradict themselves, get emotional, or skip details the flowcharts expect. That assumption is central to the claim of a reliable self-triage tool, and the paper does not close it. This is the kind of paper that will interest people working on constrained, guideline-driven AI applications in digital health or primary care. Readers who want a working example of multi-agent decomposition for medical protocols will find usable architecture details. It has enough structure, clear metrics, and a reproducible direction to deserve serious referee time, even though any review will almost certainly ask for real-world testing or stronger failure analysis. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents a proof-of-concept multi-agent conversational self-triage system that integrates LLMs with 100 AMA clinically validated flowcharts. It uses a retrieval agent to select the relevant flowchart from a patient query, a decision agent to interpret responses and navigate the flowchart, and a chat agent to produce patient-friendly outputs. The system is evaluated exclusively on large synthetic datasets of simulated conversations, reporting 95.29% top-3 accuracy for flowchart retrieval (N=2,000) and 99.10% accuracy for flowchart navigation across varied styles (N=37,200).

Significance. If the synthetic results generalize, the work offers a structured, auditable framework that combines LLM conversational flexibility with standardized clinical protocols, potentially reducing reliance on unverified online health information and improving initial triage decisions. The multi-agent design and use of validated flowcharts provide a concrete path toward transparent AI-assisted decision support in healthcare.

major comments (2)

Evaluation section: All reported performance metrics (95.29% top-3 retrieval on N=2,000 and 99.10% navigation on N=37,200) are obtained exclusively from synthetic conversation simulations. The manuscript provides no real-patient data, external validation, or ablation on inputs containing disfluencies, contradictions, emotional language, or information gaps absent from the flowchart templates, which directly undermines the central claim of a reliable and generalizable self-triage system for actual deployment.
Abstract and Discussion: The claim that the approach demonstrates 'generalizable AI-assisted self-triage' is not supported by the evaluation design, as the synthetic generator is not shown to model realistic patient behavior distributions; this assumption is load-bearing for the stated goal of supporting informed patient decision-making.

minor comments (2)

The description of the synthetic data generation process lacks sufficient detail on how 'varied conversational styles and conditions' were parameterized, making it difficult to assess coverage of edge cases.
Figure captions and table headers could more explicitly distinguish between retrieval accuracy and navigation accuracy to improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the evaluation is limited to synthetic data and that some claims in the original manuscript overstated generalizability. We have revised the abstract, added a dedicated Limitations section, and qualified language throughout to better align claims with the proof-of-concept scope and synthetic evaluation. We address each major comment below.

read point-by-point responses

Referee: Evaluation section: All reported performance metrics (95.29% top-3 retrieval on N=2,000 and 99.10% navigation on N=37,200) are obtained exclusively from synthetic conversation simulations. The manuscript provides no real-patient data, external validation, or ablation on inputs containing disfluencies, contradictions, emotional language, or information gaps absent from the flowchart templates, which directly undermines the central claim of a reliable and generalizable self-triage system for actual deployment.

Authors: We agree that the reported metrics derive exclusively from synthetic simulations and that this constrains claims of reliability for real deployment. The synthetic generator was constructed to test robustness across varied styles and conditions at large scale (N=37,200), which is difficult to achieve with real data. In the revised manuscript we have added a Limitations subsection that explicitly notes the absence of real-patient data, disfluencies, contradictions, and emotional language, and we have incorporated additional synthetic ablations simulating these phenomena. We have also moderated the central claim from demonstrating a 'reliable and generalizable self-triage system' to showing technical feasibility on synthetic benchmarks. Real-patient data and external validation cannot be added in this revision, as they require prospective collection and IRB approval. revision: partial
Referee: Abstract and Discussion: The claim that the approach demonstrates 'generalizable AI-assisted self-triage' is not supported by the evaluation design, as the synthetic generator is not shown to model realistic patient behavior distributions; this assumption is load-bearing for the stated goal of supporting informed patient decision-making.

Authors: We accept this critique. The original wording implied broader generalizability than the synthetic evaluation can support, and the generator was not validated against real patient distributions. In the revised abstract and Discussion we have replaced phrases such as 'demonstrates ... generalizable AI-assisted self-triage' with 'achieves high accuracy on large-scale synthetic benchmarks' and 'provides a proof-of-concept for structured, auditable triage'. We now explicitly state that real-world generalizability remains to be confirmed through future studies with actual patients. These changes remove the load-bearing assumption from the current claims. revision: yes

standing simulated objections not resolved

Inclusion of real-patient data, external clinical validation, or IRB-approved human-subject studies, which lie outside the scope and timeline of the current revision.

Circularity Check

0 steps flagged

No circularity: empirical metrics on independent synthetic test sets

full rationale

The paper reports direct empirical accuracies (95.29% top-3 retrieval on N=2,000 and 99.10% navigation on N=37,200) computed from separately generated synthetic conversation datasets. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the derivation chain. The multi-agent architecture is evaluated against externally constructed test cases rather than quantities defined inside the system itself, so the central claims remain independent of the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the clinical validity and coverage of the external AMA flowcharts plus the assumption that current LLMs can reliably interpret and follow flowchart logic in free-text dialogue without hallucinating steps.

axioms (2)

domain assumption The 100 flowcharts from the American Medical Association are clinically validated and sufficient to cover the medical scenarios encountered in self-triage.
Invoked when the system selects and navigates these flowcharts as the authoritative source for recommendations.
domain assumption Large language models can accurately interpret patient free-text responses and map them to flowchart decision points without introducing errors.
Required for the decision agent to maintain 99.10% navigation accuracy.

invented entities (1)

Three-agent framework (retrieval agent, decision agent, chat agent) no independent evidence
purpose: To decompose flowchart selection, logic following, and patient communication into separate LLM roles.
Newly assembled in this work to structure the interaction with the flowcharts.

pith-pipeline@v0.9.0 · 5536 in / 1524 out tokens · 33598 ms · 2026-05-17T21:50:28.314488+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The system leverages a multi-agent framework consisting of a retrieval agent, a decision agent, and a chat agent to identify the most relevant flowchart, interpret patient responses, and deliver personalized, patient-friendly recommendations... Performance was evaluated at scale using synthetic datasets of simulated conversations. The system achieved 95.29% top-3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation (N=37,200).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

[1]

Yes” or “No

Brief and direct 2. No elaboration or rephrasing the question. 3. Focus on “Yes” or “No” Q: “Are you more than three months pregnant?” A: “Nope.” Descriptive Conclusive and descriptive: Responses that clearly answer the question and provide additional details, context, or elaboration to support the answer

work page
[2]

Are you more than three months pregnant?

Clear and definitive 2. Expand beyond simply answering the question. 3. Always include details, context, or a personal anecdote. Q: “Are you more than three months pregnant?” A: “No, I just had a test recently, and it’s negative.” Weak Vague or partially conclusive: Responses that lean towards an answer but include uncertainty or hedge the statement with ...

work page
[3]

Yes” or “No

Show doubt or ambiguity. 2. Provide a partial inclination toward “Yes” or “No” but doesn’t fully commit. 3. Indicators like “I guess”, “Maybe”, “Possibly” Q: “Are you more than three months pregnant?” A: “I doubt it, but I guess it’s possible” Uncertain Inconclusive: Responses that remain uncertain due to a lack of sufficient information, neither confirmi...

work page
[4]

I don’t know

Uncertain. 2. Indicators like “I don’t know” or “I’m not sure”. Q: “Are you more than three months pregnant?” A: “I’m not sure. I haven’t checked yet.” Off-topic Irrelevant: Responses that are completely unrelated to the question but still make basic conversational sense

work page
[5]

Are you more than three months pregnant?

Off-topic but plausible. 2. Introduces unrelated information that does not pertain to the question. Q: “Are you more than three months pregnant?” A: “Oh, I’ve been organizing my closet lately. It’s such a mess!” Table 2 Five different conversational patterns defined for synthetic patient response generation. For each question node in the flowcharts, we ge...

work page
[6]

& Cohen, R

Wang, X. & Cohen, R. A. Health Information Technology Use Among Adults: United States, July-December 2022 . https://stacks.cdc.gov/view/cdc/133700 (2023) doi:10.15620/cdc:133700

work page doi:10.15620/cdc:133700 2022
[7]

https://openai.com/index/chatgpt/ (2024)

Introducing ChatGPT. https://openai.com/index/chatgpt/ (2024)

work page 2024
[8]

M., Wiesenfeld, B

Mendel, T., Singh, N., Mann, D. M., Wiesenfeld, B. & Nov, O. Laypeople’s Use of and Attitudes Toward Large Language Models and Search Engines for Health Queries: Survey Study. J. Med. Internet Res. 27 , e64290 (2025)

work page 2025
[9]

V., Ukert, B

Giannouchos, T. V., Ukert, B. & Wright, B. Concordance in Medical Urgency Classification of Discharge Diagnoses and Reasons for Visit. JAMA Netw. Open 7 , e2350522 (2024)

work page 2024
[10]

Mayo Clinic https://www.mayoclinic.org/symptom-checker/select-symptom/itt-20009075

Symptom Checker. Mayo Clinic https://www.mayoclinic.org/symptom-checker/select-symptom/itt-20009075

work page
[11]

WebMD https://symptoms.webmd.com/

Symptom Checker with Body from WebMD - Check Your Medical Symptoms. WebMD https://symptoms.webmd.com/

work page
[12]

https://symptomchecker.isabelhealthcare.com

Symptom Checker : Check your symptoms | Isabel Healthcare. https://symptomchecker.isabelhealthcare.com

work page
[13]

Wallace, W. et al. The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review. Npj Digit. Med. 5 , 118 (2022)

work page 2022
[14]

R., Mahajan, S

Aboueid, S., Meyer, S., Wallace, J. R., Mahajan, S. & Chaurasia, A. Young Adults’ Perspectives on the Use of Symptom Checkers for Self-Triage and Self-Diagnosis: Qualitative Study. JMIR Public Health Surveill. 7 , e22637 (2021)

work page 2021
[15]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Xu, Z., Jain, S. & Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2401.11817 (2025)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.11817 2025
[16]

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.New England Journal of Medicine, 388(13):1233–1239, March 2023

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine | New England Journal of Medicine. https://www.nejm.org/doi/10.1056/NEJMsr2214184?url_ver=Z39.88-2003&rfr_id=ori:rid:cross ref.org&rfr_dat=cr_pub%20%200pubmed

work page doi:10.1056/nejmsr2214184 2003
[17]

Faithfulness Hallucination Detection in Healthcare AI. (2024)

work page 2024
[18]

Zhao, H., Yang, F., Shen, B., Lakkaraju, H. & Du, M. Towards Uncovering How Large Language Model Works: An Explainability Perspective. Preprint at https://doi.org/10.48550/arXiv.2402.10688 (2024)

work page doi:10.48550/arxiv.2402.10688 2024
[19]

https://arxiv.org/html/2506.21812?utm_source=chatgpt.com

Towards Transparent AI: A Survey on Explainable Large Language Models. https://arxiv.org/html/2506.21812?utm_source=chatgpt.com

work page arXiv
[20]

C., Gablasova, D

Collins, L. C., Gablasova, D. & Pill, J. ’Doing Questioning’ in the Emergency Department (ED). Health Commun. 38 , 2721–2729 (2023)

work page 2023
[21]

& Shumway, M

Tai-Seale, M., Stults, C., Zhang, W. & Shumway, M. Expressing uncertainty in clinical interactions between physicians and older patients: what matters? Patient Educ. Couns. 86 , 322–328 (2012)

work page 2012
[22]

https://magazine.hms.harvard.edu/articles/navigating-uncertainties-medicine

Navigating the Uncertainties of Medicine | Harvard Medicine Magazine. https://magazine.hms.harvard.edu/articles/navigating-uncertainties-medicine

work page
[23]

Bommasani, R. et al. The 2024 Foundation Model Transparency Index. Preprint at https://doi.org/10.48550/arXiv.2407.12929 (2025)

work page doi:10.48550/arxiv.2407.12929 2024
[24]

& Specia, L

Luo, H. & Specia, L. From Understanding to Utilization: A Survey on Explainability for Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2401.12874 (2024)

work page doi:10.48550/arxiv.2401.12874 2024
[25]

Casper, S. et al. Black-Box Access is Insufficient for Rigorous AI Audits. in Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency 2254–2272 (Association for Computing Machinery, New York, NY, USA, 2024). doi:10.1145/3630106.3659037

work page doi:10.1145/3630106.3659037 2024
[26]

Iversen, E. D. et al. Communication Skills Training: A Means to Promote Time-Efficient Patient-Centered Communication in Clinical Practice. J. Patient-Centered Res. Rev. 8 , 307–314 (2021)

work page 2021
[27]

Mata, Á. N. de S. et al. Training in communication skills for self-efficacy of health professionals: a systematic review. Hum. Resour. Health 19 , 30 (2021)

work page 2021
[28]

& Atkins, S

Swinglehurst, D. & Atkins, S. When ‘yes’ means ‘no’: why the small details of clinical interactions matter. Br. J. Gen. Pract. 68 , 410–411 (2018)

work page 2018
[29]

Robinson, J. D. & Heritage, J. Physicians’ opening questions and patients’ satisfaction. Patient Educ. Couns. 60 , 279–285 (2006)

work page 2006
[30]

Gao, L. et al. PAL: program-aided language models. in Proceedings of the 40th International Conference on Machine Learning vol. 202 10764–10799 (JMLR.org, Honolulu, Hawaii, USA, 2023)

work page 2023
[31]

Mishra, M. et al. Prompting with Pseudo-Code Instructions. Preprint at https://doi.org/10.48550/arXiv.2305.11790 (2023)

work page doi:10.48550/arxiv.2305.11790 2023
[32]

no flowchart available

Skianis, K., Nikolentzos, G. & Vazirgiannis, M. Graph Reasoning with Large Language Models via Pseudo-code Prompting. Preprint at https://doi.org/10.48550/arXiv.2409.17906 (2024). Supplementary MaterialsTable 1: Failure analysis for flowchart retrieval evaluation taskModelFailure TypeFailure DefinitionOccurrence Ranking Failure Example PatternDemographics...

work page doi:10.48550/arxiv.2409.17906 2024
[33]

Brief Certain & CorrectUncertain & CorrectUncertain & IncorrectCertain & Incorrect OpenAI GPT4o97.59% 1.53% 0.03% 0.85% Claude Haiku 98.71% 0.44% 0.32% 0.53% Gemini Lite 93.97% 0.51% 0.46% 5.06% DeepSeek Chat 97.27% 0.96% 0.66% 1.12% Average 96.88% 0.86% 0.37% 1.89% Standard Deviation2.04% 0.50% 0.26% 2.13%

work page
[34]

Descriptive Certain & CorrectUncertain & CorrectUncertain & IncorrectCertain & Incorrect OpenAI GPT4o96.97% 2.77% 0.02% 0.24% Claude Haiku 98.20% 0.94% 0.61% 0.25% Gemini Lite 95.63% 3.89% 0.09% 0.39% DeepSeek Chat 96.60% 2.73% 0.13% 0.54% Average 96.85% 2.58% 0.21% 0.35% Standard Deviation1.06% 1.22% 0.27% 0.14%

work page
[35]

Weak Certain & CorrectUncertain & CorrectUncertain & IncorrectCertain & Incorrect OpenAI GPT4o 6.82% 70.77% 22.32% 0.09% Claude Haiku 33.04% 48.70% 18.09% 0.17% Gemini Lite 30.69% 47.05% 21.76% 0.49% DeepSeek Chat 23.41% 55.00% 21.09% 0.51% Average 23.49% 55.38% 20.81% 0.31% Standard Deviation11.85% 10.82% 1.89% 0.22%

work page
[36]

Uncertain Certain & UnansweredUncertain & UnansweredUncertain & AnsweredCertain & Answered OpenAI GPT4o 0.26% 93.55% 6.15% 0.04% Claude Haiku 0.45% 96.13% 3.29% 0.13% Gemini Lite 0.47% 83.61% 15.33% 0.58% DeepSeek Chat 0.62% 70.65% 27.61% 1.12%

work page
[37]

Uncertain Average 0.45% 85.98% 13.10% 0.47% Standard Deviation0.15% 11.56% 10.96% 0.49%

work page
[38]

Is your temperature 100°F or higher?

Off-topic Off-topic On-topic OpenAI GPT4o99.98% 0.02% Claude Haiku 93.18% 6.82% Gemini Lite 99.89% 0.11% DeepSeek Chat 99.98% 0.02% Average 98.26% 1.74% Standard Deviation3.39% 3.39% Table 7: Prompts used for building the agents and generating synthetic dataset Retrieval Agent Your role: You are an assistant supporting an Emergency Department nurse in pat...

work page

[1] [1]

Yes” or “No

Brief and direct 2. No elaboration or rephrasing the question. 3. Focus on “Yes” or “No” Q: “Are you more than three months pregnant?” A: “Nope.” Descriptive Conclusive and descriptive: Responses that clearly answer the question and provide additional details, context, or elaboration to support the answer

work page

[2] [2]

Are you more than three months pregnant?

Clear and definitive 2. Expand beyond simply answering the question. 3. Always include details, context, or a personal anecdote. Q: “Are you more than three months pregnant?” A: “No, I just had a test recently, and it’s negative.” Weak Vague or partially conclusive: Responses that lean towards an answer but include uncertainty or hedge the statement with ...

work page

[3] [3]

Yes” or “No

Show doubt or ambiguity. 2. Provide a partial inclination toward “Yes” or “No” but doesn’t fully commit. 3. Indicators like “I guess”, “Maybe”, “Possibly” Q: “Are you more than three months pregnant?” A: “I doubt it, but I guess it’s possible” Uncertain Inconclusive: Responses that remain uncertain due to a lack of sufficient information, neither confirmi...

work page

[4] [4]

I don’t know

Uncertain. 2. Indicators like “I don’t know” or “I’m not sure”. Q: “Are you more than three months pregnant?” A: “I’m not sure. I haven’t checked yet.” Off-topic Irrelevant: Responses that are completely unrelated to the question but still make basic conversational sense

work page

[5] [5]

Are you more than three months pregnant?

Off-topic but plausible. 2. Introduces unrelated information that does not pertain to the question. Q: “Are you more than three months pregnant?” A: “Oh, I’ve been organizing my closet lately. It’s such a mess!” Table 2 Five different conversational patterns defined for synthetic patient response generation. For each question node in the flowcharts, we ge...

work page

[6] [6]

& Cohen, R

Wang, X. & Cohen, R. A. Health Information Technology Use Among Adults: United States, July-December 2022 . https://stacks.cdc.gov/view/cdc/133700 (2023) doi:10.15620/cdc:133700

work page doi:10.15620/cdc:133700 2022

[7] [7]

https://openai.com/index/chatgpt/ (2024)

Introducing ChatGPT. https://openai.com/index/chatgpt/ (2024)

work page 2024

[8] [8]

M., Wiesenfeld, B

Mendel, T., Singh, N., Mann, D. M., Wiesenfeld, B. & Nov, O. Laypeople’s Use of and Attitudes Toward Large Language Models and Search Engines for Health Queries: Survey Study. J. Med. Internet Res. 27 , e64290 (2025)

work page 2025

[9] [9]

V., Ukert, B

Giannouchos, T. V., Ukert, B. & Wright, B. Concordance in Medical Urgency Classification of Discharge Diagnoses and Reasons for Visit. JAMA Netw. Open 7 , e2350522 (2024)

work page 2024

[10] [10]

Mayo Clinic https://www.mayoclinic.org/symptom-checker/select-symptom/itt-20009075

Symptom Checker. Mayo Clinic https://www.mayoclinic.org/symptom-checker/select-symptom/itt-20009075

work page

[11] [11]

WebMD https://symptoms.webmd.com/

Symptom Checker with Body from WebMD - Check Your Medical Symptoms. WebMD https://symptoms.webmd.com/

work page

[12] [12]

https://symptomchecker.isabelhealthcare.com

Symptom Checker : Check your symptoms | Isabel Healthcare. https://symptomchecker.isabelhealthcare.com

work page

[13] [13]

Wallace, W. et al. The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review. Npj Digit. Med. 5 , 118 (2022)

work page 2022

[14] [14]

R., Mahajan, S

Aboueid, S., Meyer, S., Wallace, J. R., Mahajan, S. & Chaurasia, A. Young Adults’ Perspectives on the Use of Symptom Checkers for Self-Triage and Self-Diagnosis: Qualitative Study. JMIR Public Health Surveill. 7 , e22637 (2021)

work page 2021

[15] [15]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Xu, Z., Jain, S. & Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2401.11817 (2025)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.11817 2025

[16] [16]

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.New England Journal of Medicine, 388(13):1233–1239, March 2023

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine | New England Journal of Medicine. https://www.nejm.org/doi/10.1056/NEJMsr2214184?url_ver=Z39.88-2003&rfr_id=ori:rid:cross ref.org&rfr_dat=cr_pub%20%200pubmed

work page doi:10.1056/nejmsr2214184 2003

[17] [17]

Faithfulness Hallucination Detection in Healthcare AI. (2024)

work page 2024

[18] [18]

Zhao, H., Yang, F., Shen, B., Lakkaraju, H. & Du, M. Towards Uncovering How Large Language Model Works: An Explainability Perspective. Preprint at https://doi.org/10.48550/arXiv.2402.10688 (2024)

work page doi:10.48550/arxiv.2402.10688 2024

[19] [19]

https://arxiv.org/html/2506.21812?utm_source=chatgpt.com

Towards Transparent AI: A Survey on Explainable Large Language Models. https://arxiv.org/html/2506.21812?utm_source=chatgpt.com

work page arXiv

[20] [20]

C., Gablasova, D

Collins, L. C., Gablasova, D. & Pill, J. ’Doing Questioning’ in the Emergency Department (ED). Health Commun. 38 , 2721–2729 (2023)

work page 2023

[21] [21]

& Shumway, M

Tai-Seale, M., Stults, C., Zhang, W. & Shumway, M. Expressing uncertainty in clinical interactions between physicians and older patients: what matters? Patient Educ. Couns. 86 , 322–328 (2012)

work page 2012

[22] [22]

https://magazine.hms.harvard.edu/articles/navigating-uncertainties-medicine

Navigating the Uncertainties of Medicine | Harvard Medicine Magazine. https://magazine.hms.harvard.edu/articles/navigating-uncertainties-medicine

work page

[23] [23]

Bommasani, R. et al. The 2024 Foundation Model Transparency Index. Preprint at https://doi.org/10.48550/arXiv.2407.12929 (2025)

work page doi:10.48550/arxiv.2407.12929 2024

[24] [24]

& Specia, L

Luo, H. & Specia, L. From Understanding to Utilization: A Survey on Explainability for Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2401.12874 (2024)

work page doi:10.48550/arxiv.2401.12874 2024

[25] [25]

Casper, S. et al. Black-Box Access is Insufficient for Rigorous AI Audits. in Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency 2254–2272 (Association for Computing Machinery, New York, NY, USA, 2024). doi:10.1145/3630106.3659037

work page doi:10.1145/3630106.3659037 2024

[26] [26]

Iversen, E. D. et al. Communication Skills Training: A Means to Promote Time-Efficient Patient-Centered Communication in Clinical Practice. J. Patient-Centered Res. Rev. 8 , 307–314 (2021)

work page 2021

[27] [27]

Mata, Á. N. de S. et al. Training in communication skills for self-efficacy of health professionals: a systematic review. Hum. Resour. Health 19 , 30 (2021)

work page 2021

[28] [28]

& Atkins, S

Swinglehurst, D. & Atkins, S. When ‘yes’ means ‘no’: why the small details of clinical interactions matter. Br. J. Gen. Pract. 68 , 410–411 (2018)

work page 2018

[29] [29]

Robinson, J. D. & Heritage, J. Physicians’ opening questions and patients’ satisfaction. Patient Educ. Couns. 60 , 279–285 (2006)

work page 2006

[30] [30]

Gao, L. et al. PAL: program-aided language models. in Proceedings of the 40th International Conference on Machine Learning vol. 202 10764–10799 (JMLR.org, Honolulu, Hawaii, USA, 2023)

work page 2023

[31] [31]

Mishra, M. et al. Prompting with Pseudo-Code Instructions. Preprint at https://doi.org/10.48550/arXiv.2305.11790 (2023)

work page doi:10.48550/arxiv.2305.11790 2023

[32] [32]

no flowchart available

Skianis, K., Nikolentzos, G. & Vazirgiannis, M. Graph Reasoning with Large Language Models via Pseudo-code Prompting. Preprint at https://doi.org/10.48550/arXiv.2409.17906 (2024). Supplementary MaterialsTable 1: Failure analysis for flowchart retrieval evaluation taskModelFailure TypeFailure DefinitionOccurrence Ranking Failure Example PatternDemographics...

work page doi:10.48550/arxiv.2409.17906 2024

[33] [33]

Brief Certain & CorrectUncertain & CorrectUncertain & IncorrectCertain & Incorrect OpenAI GPT4o97.59% 1.53% 0.03% 0.85% Claude Haiku 98.71% 0.44% 0.32% 0.53% Gemini Lite 93.97% 0.51% 0.46% 5.06% DeepSeek Chat 97.27% 0.96% 0.66% 1.12% Average 96.88% 0.86% 0.37% 1.89% Standard Deviation2.04% 0.50% 0.26% 2.13%

work page

[34] [34]

Descriptive Certain & CorrectUncertain & CorrectUncertain & IncorrectCertain & Incorrect OpenAI GPT4o96.97% 2.77% 0.02% 0.24% Claude Haiku 98.20% 0.94% 0.61% 0.25% Gemini Lite 95.63% 3.89% 0.09% 0.39% DeepSeek Chat 96.60% 2.73% 0.13% 0.54% Average 96.85% 2.58% 0.21% 0.35% Standard Deviation1.06% 1.22% 0.27% 0.14%

work page

[35] [35]

Weak Certain & CorrectUncertain & CorrectUncertain & IncorrectCertain & Incorrect OpenAI GPT4o 6.82% 70.77% 22.32% 0.09% Claude Haiku 33.04% 48.70% 18.09% 0.17% Gemini Lite 30.69% 47.05% 21.76% 0.49% DeepSeek Chat 23.41% 55.00% 21.09% 0.51% Average 23.49% 55.38% 20.81% 0.31% Standard Deviation11.85% 10.82% 1.89% 0.22%

work page

[36] [36]

Uncertain Certain & UnansweredUncertain & UnansweredUncertain & AnsweredCertain & Answered OpenAI GPT4o 0.26% 93.55% 6.15% 0.04% Claude Haiku 0.45% 96.13% 3.29% 0.13% Gemini Lite 0.47% 83.61% 15.33% 0.58% DeepSeek Chat 0.62% 70.65% 27.61% 1.12%

work page

[37] [37]

Uncertain Average 0.45% 85.98% 13.10% 0.47% Standard Deviation0.15% 11.56% 10.96% 0.49%

work page

[38] [38]

Is your temperature 100°F or higher?

Off-topic Off-topic On-topic OpenAI GPT4o99.98% 0.02% Claude Haiku 93.18% 6.82% Gemini Lite 99.89% 0.11% DeepSeek Chat 99.98% 0.02% Average 98.26% 1.74% Standard Deviation3.39% 3.39% Table 7: Prompts used for building the agents and generating synthetic dataset Retrieval Agent Your role: You are an assistant supporting an Emergency Department nurse in pat...

work page