Multi-agent Self-triage System with Medical Flowcharts
Pith reviewed 2026-05-17 21:50 UTC · model grok-4.3
The pith
Multi-agent system guides LLMs through 100 medical flowcharts to reach 95% retrieval accuracy and 99% navigation accuracy on simulated conversations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A multi-agent framework that assigns separate roles to retrieval, decision, and chat agents can steer LLMs to select the correct flowchart from 100 options and then follow its branches accurately across varied conversational inputs, as shown by 95.29 percent top-3 accuracy on 2,000 queries and 99.10 percent navigation success on 37,200 interactions generated from synthetic data.
What carries the argument
The multi-agent framework with a retrieval agent to pick the right flowchart, a decision agent to interpret responses against flowchart rules, and a chat agent to output patient-friendly guidance.
If this is right
- The approach supplies an auditable record of each recommendation because every step traces back to a specific flowchart question.
- Combining free-text input with fixed clinical protocols can reduce the chance that an LLM invents medical advice.
- If scaled, the system could direct patients toward appropriate care levels and ease pressure on emergency and primary services.
Where Pith is reading between the lines
- Real-world deployment would require testing with actual patient phrasing and emotional states not present in the synthetic sets.
- The same retrieval-plus-navigation pattern could be adapted to other protocol-driven domains such as legal intake or financial eligibility screening.
- Integration with electronic health records might allow the system to pre-populate patient history into the flowchart questions.
Load-bearing premise
Performance measured on synthetic datasets of simulated conversations will generalize to real patients who use varied language, express emotions, omit details, or present cases outside the covered flowcharts.
What would settle it
A trial in which real patients interact with the system and the rate of correct flowchart selection or correct navigation falls substantially below the reported synthetic figures.
read the original abstract
Online health resources and large language models (LLMs) are increasingly used as a first point of contact for medical decision-making, yet their reliability in healthcare remains limited by low accuracy, lack of transparency, and susceptibility to unverified information. We introduce a proof-of-concept conversational self-triage system that guides LLMs with 100 clinically validated flowcharts from the American Medical Association, providing a structured and auditable framework for patient decision support. The system leverages a multi-agent framework consisting of a retrieval agent, a decision agent, and a chat agent to identify the most relevant flowchart, interpret patient responses, and deliver personalized, patient-friendly recommendations, respectively. Performance was evaluated at scale using synthetic datasets of simulated conversations. The system achieved 95.29% top-3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation across varied conversational styles and conditions (N=37,200). By combining the flexibility of free-text interaction with the rigor of standardized clinical protocols, this approach demonstrates the feasibility of transparent, accurate, and generalizable AI-assisted self-triage, with potential to support informed patient decision-making while improving healthcare resource utilization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a proof-of-concept multi-agent conversational self-triage system that integrates LLMs with 100 AMA clinically validated flowcharts. It uses a retrieval agent to select the relevant flowchart from a patient query, a decision agent to interpret responses and navigate the flowchart, and a chat agent to produce patient-friendly outputs. The system is evaluated exclusively on large synthetic datasets of simulated conversations, reporting 95.29% top-3 accuracy for flowchart retrieval (N=2,000) and 99.10% accuracy for flowchart navigation across varied styles (N=37,200).
Significance. If the synthetic results generalize, the work offers a structured, auditable framework that combines LLM conversational flexibility with standardized clinical protocols, potentially reducing reliance on unverified online health information and improving initial triage decisions. The multi-agent design and use of validated flowcharts provide a concrete path toward transparent AI-assisted decision support in healthcare.
major comments (2)
- Evaluation section: All reported performance metrics (95.29% top-3 retrieval on N=2,000 and 99.10% navigation on N=37,200) are obtained exclusively from synthetic conversation simulations. The manuscript provides no real-patient data, external validation, or ablation on inputs containing disfluencies, contradictions, emotional language, or information gaps absent from the flowchart templates, which directly undermines the central claim of a reliable and generalizable self-triage system for actual deployment.
- Abstract and Discussion: The claim that the approach demonstrates 'generalizable AI-assisted self-triage' is not supported by the evaluation design, as the synthetic generator is not shown to model realistic patient behavior distributions; this assumption is load-bearing for the stated goal of supporting informed patient decision-making.
minor comments (2)
- The description of the synthetic data generation process lacks sufficient detail on how 'varied conversational styles and conditions' were parameterized, making it difficult to assess coverage of edge cases.
- Figure captions and table headers could more explicitly distinguish between retrieval accuracy and navigation accuracy to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the evaluation is limited to synthetic data and that some claims in the original manuscript overstated generalizability. We have revised the abstract, added a dedicated Limitations section, and qualified language throughout to better align claims with the proof-of-concept scope and synthetic evaluation. We address each major comment below.
read point-by-point responses
-
Referee: Evaluation section: All reported performance metrics (95.29% top-3 retrieval on N=2,000 and 99.10% navigation on N=37,200) are obtained exclusively from synthetic conversation simulations. The manuscript provides no real-patient data, external validation, or ablation on inputs containing disfluencies, contradictions, emotional language, or information gaps absent from the flowchart templates, which directly undermines the central claim of a reliable and generalizable self-triage system for actual deployment.
Authors: We agree that the reported metrics derive exclusively from synthetic simulations and that this constrains claims of reliability for real deployment. The synthetic generator was constructed to test robustness across varied styles and conditions at large scale (N=37,200), which is difficult to achieve with real data. In the revised manuscript we have added a Limitations subsection that explicitly notes the absence of real-patient data, disfluencies, contradictions, and emotional language, and we have incorporated additional synthetic ablations simulating these phenomena. We have also moderated the central claim from demonstrating a 'reliable and generalizable self-triage system' to showing technical feasibility on synthetic benchmarks. Real-patient data and external validation cannot be added in this revision, as they require prospective collection and IRB approval. revision: partial
-
Referee: Abstract and Discussion: The claim that the approach demonstrates 'generalizable AI-assisted self-triage' is not supported by the evaluation design, as the synthetic generator is not shown to model realistic patient behavior distributions; this assumption is load-bearing for the stated goal of supporting informed patient decision-making.
Authors: We accept this critique. The original wording implied broader generalizability than the synthetic evaluation can support, and the generator was not validated against real patient distributions. In the revised abstract and Discussion we have replaced phrases such as 'demonstrates ... generalizable AI-assisted self-triage' with 'achieves high accuracy on large-scale synthetic benchmarks' and 'provides a proof-of-concept for structured, auditable triage'. We now explicitly state that real-world generalizability remains to be confirmed through future studies with actual patients. These changes remove the load-bearing assumption from the current claims. revision: yes
- Inclusion of real-patient data, external clinical validation, or IRB-approved human-subject studies, which lie outside the scope and timeline of the current revision.
Circularity Check
No circularity: empirical metrics on independent synthetic test sets
full rationale
The paper reports direct empirical accuracies (95.29% top-3 retrieval on N=2,000 and 99.10% navigation on N=37,200) computed from separately generated synthetic conversation datasets. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the derivation chain. The multi-agent architecture is evaluated against externally constructed test cases rather than quantities defined inside the system itself, so the central claims remain independent of the reported results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 100 flowcharts from the American Medical Association are clinically validated and sufficient to cover the medical scenarios encountered in self-triage.
- domain assumption Large language models can accurately interpret patient free-text responses and map them to flowchart decision points without introducing errors.
invented entities (1)
-
Three-agent framework (retrieval agent, decision agent, chat agent)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The system leverages a multi-agent framework consisting of a retrieval agent, a decision agent, and a chat agent to identify the most relevant flowchart, interpret patient responses, and deliver personalized, patient-friendly recommendations... Performance was evaluated at scale using synthetic datasets of simulated conversations. The system achieved 95.29% top-3 accuracy in flowchart retrieval (N=2,000) and 99.10% accuracy in flowchart navigation (N=37,200).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Brief and direct 2. No elaboration or rephrasing the question. 3. Focus on “Yes” or “No” Q: “Are you more than three months pregnant?” A: “Nope.” Descriptive Conclusive and descriptive: Responses that clearly answer the question and provide additional details, context, or elaboration to support the answer
-
[2]
Are you more than three months pregnant?
Clear and definitive 2. Expand beyond simply answering the question. 3. Always include details, context, or a personal anecdote. Q: “Are you more than three months pregnant?” A: “No, I just had a test recently, and it’s negative.” Weak Vague or partially conclusive: Responses that lean towards an answer but include uncertainty or hedge the statement with ...
-
[3]
Show doubt or ambiguity. 2. Provide a partial inclination toward “Yes” or “No” but doesn’t fully commit. 3. Indicators like “I guess”, “Maybe”, “Possibly” Q: “Are you more than three months pregnant?” A: “I doubt it, but I guess it’s possible” Uncertain Inconclusive: Responses that remain uncertain due to a lack of sufficient information, neither confirmi...
-
[4]
Uncertain. 2. Indicators like “I don’t know” or “I’m not sure”. Q: “Are you more than three months pregnant?” A: “I’m not sure. I haven’t checked yet.” Off-topic Irrelevant: Responses that are completely unrelated to the question but still make basic conversational sense
-
[5]
Are you more than three months pregnant?
Off-topic but plausible. 2. Introduces unrelated information that does not pertain to the question. Q: “Are you more than three months pregnant?” A: “Oh, I’ve been organizing my closet lately. It’s such a mess!” Table 2 Five different conversational patterns defined for synthetic patient response generation. For each question node in the flowcharts, we ge...
-
[6]
Wang, X. & Cohen, R. A. Health Information Technology Use Among Adults: United States, July-December 2022 . https://stacks.cdc.gov/view/cdc/133700 (2023) doi:10.15620/cdc:133700
-
[7]
https://openai.com/index/chatgpt/ (2024)
Introducing ChatGPT. https://openai.com/index/chatgpt/ (2024)
work page 2024
-
[8]
Mendel, T., Singh, N., Mann, D. M., Wiesenfeld, B. & Nov, O. Laypeople’s Use of and Attitudes Toward Large Language Models and Search Engines for Health Queries: Survey Study. J. Med. Internet Res. 27 , e64290 (2025)
work page 2025
-
[9]
Giannouchos, T. V., Ukert, B. & Wright, B. Concordance in Medical Urgency Classification of Discharge Diagnoses and Reasons for Visit. JAMA Netw. Open 7 , e2350522 (2024)
work page 2024
-
[10]
Mayo Clinic https://www.mayoclinic.org/symptom-checker/select-symptom/itt-20009075
Symptom Checker. Mayo Clinic https://www.mayoclinic.org/symptom-checker/select-symptom/itt-20009075
-
[11]
WebMD https://symptoms.webmd.com/
Symptom Checker with Body from WebMD - Check Your Medical Symptoms. WebMD https://symptoms.webmd.com/
-
[12]
https://symptomchecker.isabelhealthcare.com
Symptom Checker : Check your symptoms | Isabel Healthcare. https://symptomchecker.isabelhealthcare.com
-
[13]
Wallace, W. et al. The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review. Npj Digit. Med. 5 , 118 (2022)
work page 2022
-
[14]
Aboueid, S., Meyer, S., Wallace, J. R., Mahajan, S. & Chaurasia, A. Young Adults’ Perspectives on the Use of Symptom Checkers for Self-Triage and Self-Diagnosis: Qualitative Study. JMIR Public Health Surveill. 7 , e22637 (2021)
work page 2021
-
[15]
Hallucination is Inevitable: An Innate Limitation of Large Language Models
Xu, Z., Jain, S. & Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2401.11817 (2025)
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.11817 2025
-
[16]
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine | New England Journal of Medicine. https://www.nejm.org/doi/10.1056/NEJMsr2214184?url_ver=Z39.88-2003&rfr_id=ori:rid:cross ref.org&rfr_dat=cr_pub%20%200pubmed
-
[17]
Faithfulness Hallucination Detection in Healthcare AI. (2024)
work page 2024
-
[18]
Zhao, H., Yang, F., Shen, B., Lakkaraju, H. & Du, M. Towards Uncovering How Large Language Model Works: An Explainability Perspective. Preprint at https://doi.org/10.48550/arXiv.2402.10688 (2024)
-
[19]
https://arxiv.org/html/2506.21812?utm_source=chatgpt.com
Towards Transparent AI: A Survey on Explainable Large Language Models. https://arxiv.org/html/2506.21812?utm_source=chatgpt.com
-
[20]
Collins, L. C., Gablasova, D. & Pill, J. ’Doing Questioning’ in the Emergency Department (ED). Health Commun. 38 , 2721–2729 (2023)
work page 2023
-
[21]
Tai-Seale, M., Stults, C., Zhang, W. & Shumway, M. Expressing uncertainty in clinical interactions between physicians and older patients: what matters? Patient Educ. Couns. 86 , 322–328 (2012)
work page 2012
-
[22]
https://magazine.hms.harvard.edu/articles/navigating-uncertainties-medicine
Navigating the Uncertainties of Medicine | Harvard Medicine Magazine. https://magazine.hms.harvard.edu/articles/navigating-uncertainties-medicine
-
[23]
Bommasani, R. et al. The 2024 Foundation Model Transparency Index. Preprint at https://doi.org/10.48550/arXiv.2407.12929 (2025)
-
[24]
Luo, H. & Specia, L. From Understanding to Utilization: A Survey on Explainability for Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2401.12874 (2024)
-
[25]
Casper, S. et al. Black-Box Access is Insufficient for Rigorous AI Audits. in Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency 2254–2272 (Association for Computing Machinery, New York, NY, USA, 2024). doi:10.1145/3630106.3659037
-
[26]
Iversen, E. D. et al. Communication Skills Training: A Means to Promote Time-Efficient Patient-Centered Communication in Clinical Practice. J. Patient-Centered Res. Rev. 8 , 307–314 (2021)
work page 2021
-
[27]
Mata, Á. N. de S. et al. Training in communication skills for self-efficacy of health professionals: a systematic review. Hum. Resour. Health 19 , 30 (2021)
work page 2021
-
[28]
Swinglehurst, D. & Atkins, S. When ‘yes’ means ‘no’: why the small details of clinical interactions matter. Br. J. Gen. Pract. 68 , 410–411 (2018)
work page 2018
-
[29]
Robinson, J. D. & Heritage, J. Physicians’ opening questions and patients’ satisfaction. Patient Educ. Couns. 60 , 279–285 (2006)
work page 2006
-
[30]
Gao, L. et al. PAL: program-aided language models. in Proceedings of the 40th International Conference on Machine Learning vol. 202 10764–10799 (JMLR.org, Honolulu, Hawaii, USA, 2023)
work page 2023
-
[31]
Mishra, M. et al. Prompting with Pseudo-Code Instructions. Preprint at https://doi.org/10.48550/arXiv.2305.11790 (2023)
-
[32]
Skianis, K., Nikolentzos, G. & Vazirgiannis, M. Graph Reasoning with Large Language Models via Pseudo-code Prompting. Preprint at https://doi.org/10.48550/arXiv.2409.17906 (2024). Supplementary MaterialsTable 1: Failure analysis for flowchart retrieval evaluation taskModelFailure TypeFailure DefinitionOccurrence Ranking Failure Example PatternDemographics...
-
[33]
Brief Certain & CorrectUncertain & CorrectUncertain & IncorrectCertain & Incorrect OpenAI GPT4o97.59% 1.53% 0.03% 0.85% Claude Haiku 98.71% 0.44% 0.32% 0.53% Gemini Lite 93.97% 0.51% 0.46% 5.06% DeepSeek Chat 97.27% 0.96% 0.66% 1.12% Average 96.88% 0.86% 0.37% 1.89% Standard Deviation2.04% 0.50% 0.26% 2.13%
-
[34]
Descriptive Certain & CorrectUncertain & CorrectUncertain & IncorrectCertain & Incorrect OpenAI GPT4o96.97% 2.77% 0.02% 0.24% Claude Haiku 98.20% 0.94% 0.61% 0.25% Gemini Lite 95.63% 3.89% 0.09% 0.39% DeepSeek Chat 96.60% 2.73% 0.13% 0.54% Average 96.85% 2.58% 0.21% 0.35% Standard Deviation1.06% 1.22% 0.27% 0.14%
-
[35]
Weak Certain & CorrectUncertain & CorrectUncertain & IncorrectCertain & Incorrect OpenAI GPT4o 6.82% 70.77% 22.32% 0.09% Claude Haiku 33.04% 48.70% 18.09% 0.17% Gemini Lite 30.69% 47.05% 21.76% 0.49% DeepSeek Chat 23.41% 55.00% 21.09% 0.51% Average 23.49% 55.38% 20.81% 0.31% Standard Deviation11.85% 10.82% 1.89% 0.22%
-
[36]
Uncertain Certain & UnansweredUncertain & UnansweredUncertain & AnsweredCertain & Answered OpenAI GPT4o 0.26% 93.55% 6.15% 0.04% Claude Haiku 0.45% 96.13% 3.29% 0.13% Gemini Lite 0.47% 83.61% 15.33% 0.58% DeepSeek Chat 0.62% 70.65% 27.61% 1.12%
-
[37]
Uncertain Average 0.45% 85.98% 13.10% 0.47% Standard Deviation0.15% 11.56% 10.96% 0.49%
-
[38]
Is your temperature 100°F or higher?
Off-topic Off-topic On-topic OpenAI GPT4o99.98% 0.02% Claude Haiku 93.18% 6.82% Gemini Lite 99.89% 0.11% DeepSeek Chat 99.98% 0.02% Average 98.26% 1.74% Standard Deviation3.39% 3.39% Table 7: Prompts used for building the agents and generating synthetic dataset Retrieval Agent Your role: You are an assistant supporting an Emergency Department nurse in pat...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.