pith. sign in

arxiv: 2604.20022 · v2 · pith:EQMMM6FJnew · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.CL

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

Pith reviewed 2026-05-20 23:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords Bayesian inferenceclinical decision supportlarge language modelsmodular AIconversational systemsprobabilistic reasoningmedical diagnosishybrid AI systems
0
0 comments X

The pith

Separating language parsing from Bayesian inference lets smaller LLMs outperform larger standalone models in clinical conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that LLMs currently used for medical advice mix next-token prediction with actual probabilistic decision making, which creates problems like missing uncertainty tracking and hard-to-audit reasoning. MoBayes addresses this by restricting the LLM to turning patient dialogue into structured observations while a separate Bayesian module updates probabilities, chooses questions to maximize information, and sets clear thresholds for answering or deferring. Tests across real and generated medical knowledge bases show the hybrid system beats frontier LLMs, including cases where cheap sensor models plus MoBayes beat bigger autonomous models at lower cost. The gains hold even when patients communicate in adversarial or varied styles. The core suggestion is that reliable conversational clinical support comes from splitting probabilistic reasoning away from language generation rather than relying on model scale alone.

Core claim

MoBayes is a modular framework in which an LLM functions only as a language interface that converts unstructured patient conversations into structured observations, while an independent Bayesian inference module maintains and updates posterior probabilities over diagnostic hypotheses, selects follow-up questions according to expected information gain, and applies calibrated thresholds to decide when to output a diagnosis, ask more, or abstain. This separation produces explicit, trackable posteriors and allows the statistical backend to be swapped for population-specific models without retraining the language component. Across both empirical and LLM-generated knowledge bases the resulting end

What carries the argument

The MoBayes modular split, in which the LLM serves solely as a parser of patient dialogue into structured observations while a Bayesian module performs all posterior updating, question selection, and decision-threshold control.

If this is right

  • Explicit posterior tracking allows controllable abstention thresholds and auditable reasoning chains.
  • Population-specific statistical backends can be swapped without retraining the language model.
  • Cost advantages appear when pairing inexpensive LLMs with the Bayesian module rather than scaling the language model alone.
  • Performance advantages remain under adversarial patient communication styles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of language interface from probabilistic core could be tested in non-medical domains that require both natural dialogue and calibrated uncertainty, such as legal intake or financial advice.
  • Replacing the current Bayesian backend with more expressive graphical models or causal graphs might further improve calibration without changing the LLM component.
  • If parsing errors prove the main failure mode, targeted fine-tuning of the LLM solely on observation extraction could be a focused improvement path.

Load-bearing premise

The LLM can reliably convert unstructured patient conversations into structured observations that are accurate enough for the Bayesian module to produce correct posterior estimates and decisions.

What would settle it

An experiment showing that LLM parsing errors systematically shift posteriors or decision thresholds enough to produce measurably worse clinical accuracy or safety than a matched standalone LLM.

Figures

Figures reproduced from arXiv: 2604.20022 by Akhil Arora, Alexandra Kulinkina, David Sasu, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Lars Klein, Mary-Anne Hartley, Yena Chang, Yusuf Kesmen.

Figure 1
Figure 1. Figure 1: (a) Three paradigms for LLM-based diagnostic dialogue. Standalone: the LLM handles all reasoning, questioning, and diagnosis internally. LLM Bayesian: an external module computes EIG from LLM-derived posteriors, principled question selection, but no grounded knowledge base. BMBE (ours): the LLM serves only as a sensor; all diagnostic reasoning is performed by a deterministic Bayesian engine grounded in an … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the BMBE architecture. The LLM layer handles only language: parsing [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: DHS vs. API cost per token. Right: DHS vs. estimated cost per patient. In both views, BMBE sensors (circles) achieve higher DHS than standalone doctors (squares) at 10–18× lower cost. 60 70 80 90 100 Coverage (%) 40 50 60 70 80 90 100 Selective Accuracy (%) default GPT-5.4 Gemini 3.1 Pro GPT-OSS-120B Llama-4-Maverick Qwen 3.6+ Kimi K2.5 Triage (τ→0) Balanced (τ=0.50) Safety-critical (τ=0.90) BMBE + G… view at source ↗
Figure 4
Figure 4. Figure 4: Operating point control. The green curve shows the accuracy, coverage frontier of BMBE + [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: DDXPlus prior distribution sorted by prevalence. The long tail (max / min ≈ 200×) reflects real-world disease frequency; the dashed line shows the uniform baseline. Right: Distribution of positive evidence count per evaluation patient across KBs. two variants: one using GPT-5.4 and one using Gemini 3.1, enabling a cross-model comparison of zero-shot medical knowledge. Construction proceeds in two sta… view at source ↗
Figure 6
Figure 6. Figure 6: Left: Distribution of LLM-elicited binary likelihoods P(yes | d) for both GPT and Gemini KBs; the strong left skew indicates that most disease–feature associations are weak. Right: CDF of per-pair KL divergence from uniform across all three KBs; DDXPlus (empirical) has the highest informativeness, while both LLM-KBs are comparable despite being synthetically generated. agreement on medical knowledge. GPT g… view at source ↗
Figure 7
Figure 7. Figure 7: Left: Scatter plot of P(yes | d) for 45 shared features across 18 diseases (n=810 pairs); dashed line is perfect agreement. Right: Distribution of pairwise likelihood differences; the left￾skewed distribution (mean = −0.055) confirms Gemini’s systematically higher assignments. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Variance of P(yes ∣ d) across diseases 0 5 10 15 20 25 30 Number of features GPT-5.4 Gemini 2.5 … view at source ↗
Figure 8
Figure 8. Figure 8: Feature discriminativeness (left: cross-disease variance; right: cross-disease range) for [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: DDXPlus selective accuracy vs. coverage ( [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: DHS across patient personas. Shaded areas show degradation from the plain baseline [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Top-1 accuracy vs. KB size K. BMBE remains stable across a 4× increase in disease space; the standalone doctor is flat re￾gardless of K [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: illustrates the engine’s belief dynamics. The left panel tracks the posterior of the ground￾truth disease across turns for a representative case: competing hypotheses rise and fall as evidence accumulates. The right panel aggregates entropy trajectories across all sessions, separating correct and incorrect diagnoses: correct cases exhibit steady entropy collapse, while incorrect cases plateau at elevated … view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an architectural limitation: such systems lack explicit posterior tracking, controllable abstention thresholds, and auditable reasoning chains. We introduce MoBayes, a Modular Bayesian dialogue framework that separates reasoning from language. The LLM acts only as a language interface, parsing patient conversation into structured observations, while a Bayesian module performs probabilistic inference over these observations to update posteriors, select follow-up questions via expected-information-gain and determine when to stop or defer through calibrated decision thresholds. This design enables explicit posterior tracking, controllable selective decision-making, and replaceable population-specific statistical backends without retraining the language model. Across empirical and LLM-generated knowledge bases, MoBayes outperforms standalone frontier LLM doctors, including matched model-family comparisons where inexpensive sensor models paired with MoBayes exceed larger autonomous models at lower cost. The advantage persists under adversarial patient communication styles and across varying diagnostic scenarios. These results suggest that reliable conversational clinical decision support systems should separate probabilistic reasoning from language generation rather than scaling model size alone. Code is available at https://anonymous.4open.science/r/MoBayes/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MoBayes, a modular Bayesian dialogue framework for conversational clinical decision support. An LLM is used only as a language interface to parse unstructured patient conversations into structured observations (symptoms, history, etc.), while a separate Bayesian module performs posterior updating, selects follow-up questions via expected information gain, and applies calibrated thresholds for abstention or deferral. The central empirical claim is that this architecture outperforms standalone frontier LLMs (including matched model-family comparisons), remains robust under adversarial patient communication styles, and enables cost-effective pairings of small sensor models with MoBayes that exceed larger autonomous LLMs, across both empirical and LLM-generated knowledge bases. Code is released.

Significance. If the separation of parsing from probabilistic inference can be shown to be robust, the work would provide a concrete demonstration that explicit posterior tracking and controllable decision thresholds improve reliability and auditability over pure next-token prediction in clinical settings. The modular design also offers practical advantages in swapping population-specific statistical backends without retraining the language model. The public code release is a positive contribution to reproducibility.

major comments (2)
  1. [Abstract and experimental evaluation section] The central performance claims rest on the assumption that the LLM parser converts free-form dialogue into structured observations without introducing systematic errors that bias the Bayesian posteriors or expected-information-gain calculations. No quantitative parsing error rates, inter-annotator agreement scores, or ablation experiments that isolate parsing noise from inference quality are reported, even under the adversarial communication styles highlighted in the abstract. This omission makes it impossible to determine whether reported gains are attributable to the modular architecture or to cleaner inputs.
  2. [Abstract and §4 (Experiments)] The abstract asserts outperformance and robustness, yet the manuscript provides no details on baseline definitions, statistical significance tests, data exclusion criteria, or how adversarial patient styles were operationalized and measured. Without these, the degree to which the data support the claim that inexpensive sensor models + MoBayes exceed larger autonomous models cannot be verified.
minor comments (2)
  1. [§3 (Methodology)] Notation for the mapping from parsed observations to likelihood functions should be made explicit, ideally with a small example or pseudocode.
  2. [Results figures] Figure captions and axis labels in the results section would benefit from greater clarity regarding which curves correspond to MoBayes versus baseline LLM configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify gaps in experimental reporting that limit the interpretability of our results. We address each major comment below and will incorporate the requested clarifications and additional analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and experimental evaluation section] The central performance claims rest on the assumption that the LLM parser converts free-form dialogue into structured observations without introducing systematic errors that bias the Bayesian posteriors or expected-information-gain calculations. No quantitative parsing error rates, inter-annotator agreement scores, or ablation experiments that isolate parsing noise from inference quality are reported, even under the adversarial communication styles highlighted in the abstract. This omission makes it impossible to determine whether reported gains are attributable to the modular architecture or to cleaner inputs.

    Authors: We agree that explicit quantification of parser performance is necessary to substantiate the separation of language parsing from probabilistic reasoning. Although the current experiments demonstrate that performance advantages persist across both clean and adversarially perturbed dialogues, we did not report direct parsing accuracy metrics or controlled ablations. In the revision we will add (i) parsing error rates computed against ground-truth structured observations on a held-out dialogue set and (ii) an ablation comparing Bayesian inference quality with noisy versus oracle-clean parsed inputs. These additions will allow readers to isolate the contribution of the modular Bayesian component. revision: yes

  2. Referee: [Abstract and §4 (Experiments)] The abstract asserts outperformance and robustness, yet the manuscript provides no details on baseline definitions, statistical significance tests, data exclusion criteria, or how adversarial patient styles were operationalized and measured. Without these, the degree to which the data support the claim that inexpensive sensor models + MoBayes exceed larger autonomous models cannot be verified.

    Authors: We acknowledge that the experimental section lacked sufficient methodological detail. The baselines consist of standalone LLMs from the same model families used as sensor models in MoBayes; statistical comparisons were performed via paired t-tests over multiple random seeds, and adversarial styles were generated through targeted prompt modifications (e.g., vague, contradictory, or evasive patient responses). Data exclusion was limited to dialogues lacking any symptom or history information. In the revised §4 we will explicitly define all baselines, report p-values and confidence intervals, detail the prompt templates used for adversarial styles, and state the precise exclusion criteria. These clarifications will make the empirical support for the cost-effective sensor-model + MoBayes comparisons fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; modular separation uses standard Bayesian updating on parsed inputs

full rationale

The paper describes a design in which an LLM parses unstructured dialogue into structured observations that then serve as inputs to a conventional Bayesian update with expected information gain for question selection. No equations, fitted parameters, or performance metrics are shown to be defined in terms of the evaluation outcomes themselves. The claimed advantages are presented as empirical results across knowledge bases and adversarial scenarios rather than as quantities that reduce by construction to the same data or to self-citations. The central separation claim rests on the architectural distinction and comparative experiments, not on any self-definitional loop, imported uniqueness result, or ansatz smuggled via prior work by the same authors. This is the most common honest finding for a modular framework paper whose core contribution is an engineering separation rather than a mathematical derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLM parsing produces observations of sufficient quality for Bayesian inference and on the existence of calibrated decision thresholds whose values are not fully specified in the abstract.

free parameters (1)
  • abstention and decision thresholds
    Calibrated thresholds for stopping, deferring, or continuing the conversation are introduced to control selective decision-making; their specific values or fitting procedure are not detailed in the abstract.
axioms (1)
  • domain assumption Structured observations extracted by the LLM are accurate and complete enough to serve as direct inputs to Bayesian posterior updating.
    The framework treats LLM-parsed observations as reliable evidence for probabilistic inference without discussing parsing error rates or robustness checks.
invented entities (1)
  • MoBayes modular framework no independent evidence
    purpose: To enforce separation between language interface and probabilistic reasoning module.
    The framework is newly introduced in this work; no independent prior evidence for its components in this exact configuration is referenced in the abstract.

pith-pipeline@v0.9.0 · 5790 in / 1547 out tokens · 72280 ms · 2026-05-20T23:48:56.767791+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 3 internal anchors

  1. [1]

    Shortliffe.Computer-Based Medical Consultations: MYCIN

    Edward H. Shortliffe.Computer-Based Medical Consultations: MYCIN. Elsevier, 1976

  2. [2]

    F. T. de Dombal, D. J. Leaper, J. R. Staniland, A. P. McCann, and Jane C. Horrocks. Computer- aided diagnosis of acute abdominal pain.British Medical Journal, 2(5804):9–13, 1972

  3. [3]

    Miller, Harry E

    Randolph A. Miller, Harry E. Pople, and Jack D. Myers. Internist-I, an experimental computer- based diagnostic consultant for general internal medicine.New England Journal of Medicine, 307(8):468–476, 1982

  4. [4]

    Octo Barnett, James J

    G. Octo Barnett, James J. Cimino, Jon A. Hupp, and Edward P. Hoffer. DXplain: An evolving diagnostic decision-support system.JAMA, 258(1):67–74, 1987

  5. [5]

    Heckerman, Eric J

    David E. Heckerman, Eric J. Horvitz, and Bharat N. Nathwani. Toward normative expert systems: Part I. The Pathfinder project.Methods of Information in Medicine, 31(2):90–105, 1992

  6. [6]

    Greek Oracle

    Randolph A. Miller and Fred E. Masarie. The demise of the “Greek Oracle” model for medical diagnostic systems.Methods of Information in Medicine, 29(01):1–2, 1990

  7. [7]

    Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Towards expert-level medical question answering with large language models.Nature Medicine, 2025

  8. [8]

    Capabilities of Gemini Models in Medicine

    Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, et al. Capabilities of Gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

  9. [9]

    Towards conversational diagnostic AI

    Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tober, et al. Towards conversational diagnostic AI. Nature, 2025

  10. [10]

    BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

    Subhajit Choudhury, Sinead Williamson, Omar Rivasplata, and Tom Rainforth. BED-LLM: Intelligent information gathering with LLMs and Bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

  11. [11]

    DeLLMa: Decision making under uncertainty with large language models

    Ollie Liu, Deqing Fu, Dani Levy, Maryam Fazel, Adith Swaminathan, and Willie Neiswanger. DeLLMa: Decision making under uncertainty with large language models. InProceedings of the International Conference on Learning Representations (ICLR), 2025. Spotlight. 12

  12. [12]

    BIRD: A trustworthy Bayesian inference framework for large language models.arXiv preprint arXiv:2404.12494, 2024

    Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. BIRD: A trustworthy Bayesian inference framework for large language models.arXiv preprint arXiv:2404.12494, 2024

  13. [13]

    Ask patients with patience: Enabling LLMs for human-centric medical dialogue with grounded reasoning

    Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, and Junde Wu. Ask patients with patience: Enabling LLMs for human-centric medical dialogue with grounded reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2846–2857, 2025

  14. [14]

    Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov

    Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. MediQ: Question-asking LLMs and a benchmark for reliable interactive clinical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  15. [15]

    Fine-tuning large language models with medical data: Can safety be ensured?NEJM AI, 2(1), 2025

    Minkyoung Kim, Yunha Kim, Hee Jun Kang, Hyeram Seo, Heejung Choi, JiYe Han, Gaeun Kee, Seohyun Park, Soyoung Ko, Hyoje Jung, Byeolhee Kim, Tae Joon Jun, and Young-Hak Kim. Fine-tuning large language models with medical data: Can safety be ensured?NEJM AI, 2(1), 2025. doi: 10.1056/AIcs2400390

  16. [16]

    Yu, Lawrence M

    Victor L. Yu, Lawrence M. Fagan, Sharon M. Wraith, William J. Clancey, A. Carlisle Scott, John Hannigan, Robert L. Blum, Bruce G. Buchanan, and Stanley N. Cohen. Antimicrobial selection by a computer: A blinded evaluation by infectious diseases experts.JAMA, 242(12): 1279–1282, 1979

  17. [17]

    Weiss, Casimir A

    Sholom M. Weiss, Casimir A. Kulikowski, Saul Amarel, and Aran Safir. A model-based method for computer-aided medical decision-making.Artificial Intelligence, 11(1–2):145–172, 1978

  18. [18]

    Abdelzaher

    Xinyi Liu, Dachun Sun, Yi Fung, Dilek Hakkani-Tür, and Tarek F. Abdelzaher. DocCHA: Towards LLM-augmented interactive online diagnosis system. InProceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDial), pages 609–619, 2025

  19. [19]

    Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment

    Kaishuai Xu, Yi Cheng, Wenjun Hou, Qiaoyu Tan, and Wenjie Li. Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6796–6814, 2024

  20. [20]

    Thomas Savage, John Wang, Robert Gallo, Abdessalem Boukil, Vishwesh Patel, Seyed Amir Safavi-Naini, Ali Soroush, and Jonathan H. Chen. Large language model uncertainty proxies: Discrimination and calibration for medical diagnosis and treatment.Journal of the American Medical Informatics Association, 32(1):139–149, 2025

  21. [21]

    Collins, David Reich, Robert Freeman, and Eyal Klang

    Mahmud Omar, Vera Sorin, Jeremy D. Collins, David Reich, Robert Freeman, and Eyal Klang. Multi-model assurance analysis: LLMs highly vulnerable to adversarial hallucination attacks during clinical decision support.Communications Medicine, 5(1):97, 2025

  22. [22]

    Omiye, Jenna C

    Jesutofunmi A. Omiye, Jenna C. Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou. Large language models propagate race-based medicine.npj Digital Medicine, 6(1): 195, 2023

  23. [23]

    Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset

    Tobi Olatunji, Charles Nimo, Abraham Owodunni, et al. AfriMed-QA: A pan-African, multi- specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,

  24. [24]

    ACL 2025, Best Social Impact Award

  25. [25]

    Conversational disease diagnosis via external planner-controlled large language models.arXiv preprint arXiv:2404.04292, 2024

    Zhoujian Sun, Chenghua Luo, Liangzhi Jiang, Linlin Liu, Xiaohan Yang, Junfan Shi, Tangjie Lv, Benyou Zhang, and Kezhi Mao. Conversational disease diagnosis via external planner-controlled large language models.arXiv preprint arXiv:2404.04292, 2024

  26. [26]

    DDXPlus: A new dataset for automatic medical diagnosis

    Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. DDXPlus: A new dataset for automatic medical diagnosis. InAdvances in Neural Information Processing Systems, volume 35, 2022

  27. [27]

    Jeffrey.The Logic of Decision

    Richard C. Jeffrey.The Logic of Decision. McGraw-Hill, 1965. 13

  28. [28]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  29. [29]

    MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

    Mohammad Almansoori, Komal Kumar, and Hisham Cholakkal. MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions . Inproceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2025, volume LNCS 15968. Springer Nature Switzerland, September 2025

  30. [30]

    AI hospital: Benchmarking large language models in a multi-agent medical interaction simulator

    Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang Siyuan, Zhongyu Wei, and Fei Huang. AI hospital: Benchmarking large language models in a multi-agent medical interaction simulator. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Compu-...

  31. [31]

    PatientSim: A persona-driven simulator for realistic doctor-patient interactions

    Daeun Kyung, Hyunseung Chung, Seongsu Bae, Jiho Kim, Jae Ho Sohn, Taerim Kim, Soo Kyung Kim, and Edward Choi. PatientSim: A persona-driven simulator for realistic doctor-patient interactions. InAdvances in Neural Information Processing Systems, volume 38, 2025

  32. [32]

    Automatic interactive evaluation for large language models with state aware patient simulator.arXiv preprint arXiv:2403.08495,

    Yusheng Liao, Yutong Meng, Yuhao Wang, Hongcheng Liu, Yanfeng Wang, and Yu Wang. Automatic interactive evaluation for large language models with state aware patient simula- tor.ArXiv, abs/2403.08495, 2024. URL https://api.semanticscholar.org/CorpusID: 268379575

  33. [33]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

  34. [34]

    I think I had a fever

    Yifan Zhao, Yixiao Hua, Dan Roth, and Jinhao Chen. Probing the multi-turn planning capa- bilities of LLMs via 20 question games. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. 14 Appendix overview This supplementary material is organized as follows: A. Theoretical foundations. . . . . . . . . . . . ....

  35. [35]

    Classify the user response into one of the Allowed Values

  36. [36]

    Yes", "Not really

    Assess confidence: very_likely, likely, uncertain, unlikely, very_unlikely KEY RULES: – Direct answer ("Yes", "Not really"): map to closest value. – Unrelated response: return "unknown|likely". – Uncertain language ("I think so", "maybe"): use "uncertain". – Prefer "unknown" over hard negative when partial/vague. Return format: "value|confidence_level" Ex...

  37. [37]

    f_fever",

    NEVER use technical IDs (e.g., "f_fever", "d_flu")

  38. [38]

    Speak naturally and empathetically

  39. [39]

    Do NOT mention probabilities or internal values

  40. [40]

    {narrative}

    If clarifying a previous confusion, keep it brief. Bulk intake prompt.At session start, a single bulk intake call maps the patient’s opening narrative to multiple(f, v, c)triples simultaneously, reducing the number of follow-up questions needed. Bulk Intake Prompt System: You are an expert medical intake specialist. User Text: "{narrative}" TASK: Extract ...

  41. [41]

    Only extract explicitly mentioned or strongly implied features. 18

  42. [42]

    Extract demographics (age, gender, location) if present

  43. [43]

    Do NOT infer negatives from silence; omit unlisted features

  44. [44]

    feature_id

    Assess confidence for each extracted feature. Return JSON: {"feature_id": {"value": "...", "confidence": "likely"}, "demographics": {"age": N, ...}} Patient simulator prompt.The patient simulator receives the full clinical profile (demographics, chief complaint, symptoms, medical history, observed findings) and persona instructions. Crucially, the patient...

  45. [45]

    Answer based on KNOWN OBSERVED FINDINGS and patient profile

  46. [46]

    If asked about something listed: answer faithfully (including denials)

  47. [47]

    I’m not sure

    If NOT listed: say “I’m not sure” or “I don’t know”. Do not invent symptoms

  48. [48]

    NEVER reveal your diagnosis directly

  49. [49]

    Keep responses concise (1–3 sentences). PERSONA: – Language: {CEFR level A/B/C} – Personality: {plain|verbose|overanxious|distrustful} – Memory: {high|low recall} – Alertness: {normal|moderate daze|high daze} Standalone doctor prompt.The standalone LLM doctor receives no external reasoning support. It conducts the full diagnostic interview and outputs a d...

  50. [50]

    Acute exacerbation of COPD

    {prediction_2} ... REFERENCE DISEASE LIST: – {disease_name_1} – {disease_name_2} ... Output ONLY a numbered list with the matched disease name (exactly as written in the reference list) or NO_MATCH. LLM-generated KB prompts.The LLM-generated KB is constructed via two sequential prompts. Thefeature generation promptasks the model to propose clinically plau...

  51. [51]

    We then measure the posterior gap between the oracle’s top-1 and top-2 diseases

    KB Failure.We run anoracletest: all ground-truth features are supplied at confidence c=1.0. We then measure the posterior gap between the oracle’s top-1 and top-2 diseases. If this gap falls below a thresholdγ(γ=0.80), the KB cannot reliably discriminate the disease pair

  52. [52]

    Two subtypes: • False Positive (FP): the engine asks about a featureabsentfrom the patient’s ground-truth profile; the pipeline returns yes

    LLM Failure.The LLM pipeline (verbaliser + patient simulator + parser) injected incorrect evidence into the engine. Two subtypes: • False Positive (FP): the engine asks about a featureabsentfrom the patient’s ground-truth profile; the pipeline returns yes. If more than 2 such turns occur in a session, the case is flagged. The threshold reflects the empiri...

  53. [53]

    I have chest pain even at rest, upper chest pain, and pleuritic chest pain

    Inference Failure.The KB is adequate and the evidence pipeline introduced no detectable errors, yet the engine converged to the wrong diagnosis. Two subtypes: •Close: the ground truth remains in the top-3 posterior at session end, but the question budget or EIG policy did not resolve the differential. • Diverged: the ground truth is not in the top-3. The ...