MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

Akhil Arora; Alexandra Kulinkina; David Sasu; Fay Elhassan; Jiayi Ma; Julien Stalhandske; Lars Klein; Mary-Anne Hartley; Yena Chang; Yusuf Kesmen

arxiv: 2604.20022 · v2 · pith:EQMMM6FJnew · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.CL

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

Yusuf Kesmen , Fay Elhassan , Jiayi Ma , Julien Stalhandske , Yena Chang , David Sasu , Alexandra Kulinkina , Akhil Arora

show 2 more authors

Lars Klein Mary-Anne Hartley

This is my paper

Pith reviewed 2026-05-20 23:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords Bayesian inferenceclinical decision supportlarge language modelsmodular AIconversational systemsprobabilistic reasoningmedical diagnosishybrid AI systems

0 comments

The pith

Separating language parsing from Bayesian inference lets smaller LLMs outperform larger standalone models in clinical conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that LLMs currently used for medical advice mix next-token prediction with actual probabilistic decision making, which creates problems like missing uncertainty tracking and hard-to-audit reasoning. MoBayes addresses this by restricting the LLM to turning patient dialogue into structured observations while a separate Bayesian module updates probabilities, chooses questions to maximize information, and sets clear thresholds for answering or deferring. Tests across real and generated medical knowledge bases show the hybrid system beats frontier LLMs, including cases where cheap sensor models plus MoBayes beat bigger autonomous models at lower cost. The gains hold even when patients communicate in adversarial or varied styles. The core suggestion is that reliable conversational clinical support comes from splitting probabilistic reasoning away from language generation rather than relying on model scale alone.

Core claim

MoBayes is a modular framework in which an LLM functions only as a language interface that converts unstructured patient conversations into structured observations, while an independent Bayesian inference module maintains and updates posterior probabilities over diagnostic hypotheses, selects follow-up questions according to expected information gain, and applies calibrated thresholds to decide when to output a diagnosis, ask more, or abstain. This separation produces explicit, trackable posteriors and allows the statistical backend to be swapped for population-specific models without retraining the language component. Across both empirical and LLM-generated knowledge bases the resulting end

What carries the argument

The MoBayes modular split, in which the LLM serves solely as a parser of patient dialogue into structured observations while a Bayesian module performs all posterior updating, question selection, and decision-threshold control.

If this is right

Explicit posterior tracking allows controllable abstention thresholds and auditable reasoning chains.
Population-specific statistical backends can be swapped without retraining the language model.
Cost advantages appear when pairing inexpensive LLMs with the Bayesian module rather than scaling the language model alone.
Performance advantages remain under adversarial patient communication styles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of language interface from probabilistic core could be tested in non-medical domains that require both natural dialogue and calibrated uncertainty, such as legal intake or financial advice.
Replacing the current Bayesian backend with more expressive graphical models or causal graphs might further improve calibration without changing the LLM component.
If parsing errors prove the main failure mode, targeted fine-tuning of the LLM solely on observation extraction could be a focused improvement path.

Load-bearing premise

The LLM can reliably convert unstructured patient conversations into structured observations that are accurate enough for the Bayesian module to produce correct posterior estimates and decisions.

What would settle it

An experiment showing that LLM parsing errors systematically shift posteriors or decision thresholds enough to produce measurably worse clinical accuracy or safety than a matched standalone LLM.

Figures

Figures reproduced from arXiv: 2604.20022 by Akhil Arora, Alexandra Kulinkina, David Sasu, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Lars Klein, Mary-Anne Hartley, Yena Chang, Yusuf Kesmen.

**Figure 1.** Figure 1: (a) Three paradigms for LLM-based diagnostic dialogue. Standalone: the LLM handles all reasoning, questioning, and diagnosis internally. LLM Bayesian: an external module computes EIG from LLM-derived posteriors, principled question selection, but no grounded knowledge base. BMBE (ours): the LLM serves only as a sensor; all diagnostic reasoning is performed by a deterministic Bayesian engine grounded in an … view at source ↗

**Figure 2.** Figure 2: Overview of the BMBE architecture. The LLM layer handles only language: parsing [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left: DHS vs. API cost per token. Right: DHS vs. estimated cost per patient. In both views, BMBE sensors (circles) achieve higher DHS than standalone doctors (squares) at 10–18× lower cost. 60 70 80 90 100 Coverage (%) 40 50 60 70 80 90 100 Selective Accuracy (%) default GPT-5.4 Gemini 3.1 Pro GPT-OSS-120B Llama-4-Maverick Qwen 3.6+ Kimi K2.5 Triage (τ→0) Balanced (τ=0.50) Safety-critical (τ=0.90) BMBE + G… view at source ↗

**Figure 4.** Figure 4: Operating point control. The green curve shows the accuracy, coverage frontier of BMBE + [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Left: DDXPlus prior distribution sorted by prevalence. The long tail (max / min ≈ 200×) reflects real-world disease frequency; the dashed line shows the uniform baseline. Right: Distribution of positive evidence count per evaluation patient across KBs. two variants: one using GPT-5.4 and one using Gemini 3.1, enabling a cross-model comparison of zero-shot medical knowledge. Construction proceeds in two sta… view at source ↗

**Figure 6.** Figure 6: Left: Distribution of LLM-elicited binary likelihoods P(yes | d) for both GPT and Gemini KBs; the strong left skew indicates that most disease–feature associations are weak. Right: CDF of per-pair KL divergence from uniform across all three KBs; DDXPlus (empirical) has the highest informativeness, while both LLM-KBs are comparable despite being synthetically generated. agreement on medical knowledge. GPT g… view at source ↗

**Figure 7.** Figure 7: Left: Scatter plot of P(yes | d) for 45 shared features across 18 diseases (n=810 pairs); dashed line is perfect agreement. Right: Distribution of pairwise likelihood differences; the leftskewed distribution (mean = −0.055) confirms Gemini’s systematically higher assignments. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Variance of P(yes ∣ d) across diseases 0 5 10 15 20 25 30 Number of features GPT-5.4 Gemini 2.5 … view at source ↗

**Figure 8.** Figure 8: Feature discriminativeness (left: cross-disease variance; right: cross-disease range) for [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: DDXPlus selective accuracy vs. coverage ( [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: DHS across patient personas. Shaded areas show degradation from the plain baseline [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Top-1 accuracy vs. KB size K. BMBE remains stable across a 4× increase in disease space; the standalone doctor is flat regardless of K [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: illustrates the engine’s belief dynamics. The left panel tracks the posterior of the groundtruth disease across turns for a representative case: competing hypotheses rise and fall as evidence accumulates. The right panel aggregates entropy trajectories across all sessions, separating correct and incorrect diagnoses: correct cases exhibit steady entropy collapse, while incorrect cases plateau at elevated … view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an architectural limitation: such systems lack explicit posterior tracking, controllable abstention thresholds, and auditable reasoning chains. We introduce MoBayes, a Modular Bayesian dialogue framework that separates reasoning from language. The LLM acts only as a language interface, parsing patient conversation into structured observations, while a Bayesian module performs probabilistic inference over these observations to update posteriors, select follow-up questions via expected-information-gain and determine when to stop or defer through calibrated decision thresholds. This design enables explicit posterior tracking, controllable selective decision-making, and replaceable population-specific statistical backends without retraining the language model. Across empirical and LLM-generated knowledge bases, MoBayes outperforms standalone frontier LLM doctors, including matched model-family comparisons where inexpensive sensor models paired with MoBayes exceed larger autonomous models at lower cost. The advantage persists under adversarial patient communication styles and across varying diagnostic scenarios. These results suggest that reliable conversational clinical decision support systems should separate probabilistic reasoning from language generation rather than scaling model size alone. Code is available at https://anonymous.4open.science/r/MoBayes/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoBayes keeps the LLM to observation extraction only and lets a Bayesian backend handle posteriors, question selection, and abstention, with reported gains over pure LLMs even in matched comparisons.

read the letter

The main point is that this paper splits the work so the LLM only turns patient conversation into structured observations while a Bayesian module does the actual probabilistic updating, picks questions by expected information gain, and sets abstention thresholds. That separation is the concrete contribution, and it allows swapping in different population-specific backends without retraining the language model. They show the combined system beating standalone frontier LLMs, including cheaper sensor models plus MoBayes outperforming larger autonomous ones, and the edge holds across diagnostic scenarios and adversarial patient styles. Code release helps with checking the details. The results draw from both empirical and LLM-generated knowledge bases, which gives some breadth to the tests. The architecture makes posterior tracking and decision thresholds explicit, which addresses a real limitation in end-to-end LLM clinical tools. The soft spot is the parsing step. The claims rest on the LLM producing clean, accurate observations that feed directly into the Bayesian updates. The abstract asserts robustness under adversarial communication, but without reported parsing error rates, inter-annotator checks, or ablations that isolate observation quality from inference quality, it is hard to rule out that gains come from unusually clean inputs rather than the modular design itself. If parsing noise is material, the posteriors and question selection would be systematically off. This is aimed at people working on clinical decision support or hybrid neuro-symbolic systems who need controllable uncertainty handling. Readers looking for practical ways to add auditability to conversational medical AI would get value from the architecture and the comparisons. The work is coherent on its own terms and has empirical backing plus code, so it deserves a serious referee rather than a desk reject. I would send it for review but flag the parsing validation and any sensitivity tests around observation errors as points for the referees to examine closely.

Referee Report

2 major / 2 minor

Summary. The paper introduces MoBayes, a modular Bayesian dialogue framework for conversational clinical decision support. An LLM is used only as a language interface to parse unstructured patient conversations into structured observations (symptoms, history, etc.), while a separate Bayesian module performs posterior updating, selects follow-up questions via expected information gain, and applies calibrated thresholds for abstention or deferral. The central empirical claim is that this architecture outperforms standalone frontier LLMs (including matched model-family comparisons), remains robust under adversarial patient communication styles, and enables cost-effective pairings of small sensor models with MoBayes that exceed larger autonomous LLMs, across both empirical and LLM-generated knowledge bases. Code is released.

Significance. If the separation of parsing from probabilistic inference can be shown to be robust, the work would provide a concrete demonstration that explicit posterior tracking and controllable decision thresholds improve reliability and auditability over pure next-token prediction in clinical settings. The modular design also offers practical advantages in swapping population-specific statistical backends without retraining the language model. The public code release is a positive contribution to reproducibility.

major comments (2)

[Abstract and experimental evaluation section] The central performance claims rest on the assumption that the LLM parser converts free-form dialogue into structured observations without introducing systematic errors that bias the Bayesian posteriors or expected-information-gain calculations. No quantitative parsing error rates, inter-annotator agreement scores, or ablation experiments that isolate parsing noise from inference quality are reported, even under the adversarial communication styles highlighted in the abstract. This omission makes it impossible to determine whether reported gains are attributable to the modular architecture or to cleaner inputs.
[Abstract and §4 (Experiments)] The abstract asserts outperformance and robustness, yet the manuscript provides no details on baseline definitions, statistical significance tests, data exclusion criteria, or how adversarial patient styles were operationalized and measured. Without these, the degree to which the data support the claim that inexpensive sensor models + MoBayes exceed larger autonomous models cannot be verified.

minor comments (2)

[§3 (Methodology)] Notation for the mapping from parsed observations to likelihood functions should be made explicit, ideally with a small example or pseudocode.
[Results figures] Figure captions and axis labels in the results section would benefit from greater clarity regarding which curves correspond to MoBayes versus baseline LLM configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify gaps in experimental reporting that limit the interpretability of our results. We address each major comment below and will incorporate the requested clarifications and additional analyses in the revised manuscript.

read point-by-point responses

Referee: [Abstract and experimental evaluation section] The central performance claims rest on the assumption that the LLM parser converts free-form dialogue into structured observations without introducing systematic errors that bias the Bayesian posteriors or expected-information-gain calculations. No quantitative parsing error rates, inter-annotator agreement scores, or ablation experiments that isolate parsing noise from inference quality are reported, even under the adversarial communication styles highlighted in the abstract. This omission makes it impossible to determine whether reported gains are attributable to the modular architecture or to cleaner inputs.

Authors: We agree that explicit quantification of parser performance is necessary to substantiate the separation of language parsing from probabilistic reasoning. Although the current experiments demonstrate that performance advantages persist across both clean and adversarially perturbed dialogues, we did not report direct parsing accuracy metrics or controlled ablations. In the revision we will add (i) parsing error rates computed against ground-truth structured observations on a held-out dialogue set and (ii) an ablation comparing Bayesian inference quality with noisy versus oracle-clean parsed inputs. These additions will allow readers to isolate the contribution of the modular Bayesian component. revision: yes
Referee: [Abstract and §4 (Experiments)] The abstract asserts outperformance and robustness, yet the manuscript provides no details on baseline definitions, statistical significance tests, data exclusion criteria, or how adversarial patient styles were operationalized and measured. Without these, the degree to which the data support the claim that inexpensive sensor models + MoBayes exceed larger autonomous models cannot be verified.

Authors: We acknowledge that the experimental section lacked sufficient methodological detail. The baselines consist of standalone LLMs from the same model families used as sensor models in MoBayes; statistical comparisons were performed via paired t-tests over multiple random seeds, and adversarial styles were generated through targeted prompt modifications (e.g., vague, contradictory, or evasive patient responses). Data exclusion was limited to dialogues lacking any symptom or history information. In the revised §4 we will explicitly define all baselines, report p-values and confidence intervals, detail the prompt templates used for adversarial styles, and state the precise exclusion criteria. These clarifications will make the empirical support for the cost-effective sensor-model + MoBayes comparisons fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; modular separation uses standard Bayesian updating on parsed inputs

full rationale

The paper describes a design in which an LLM parses unstructured dialogue into structured observations that then serve as inputs to a conventional Bayesian update with expected information gain for question selection. No equations, fitted parameters, or performance metrics are shown to be defined in terms of the evaluation outcomes themselves. The claimed advantages are presented as empirical results across knowledge bases and adversarial scenarios rather than as quantities that reduce by construction to the same data or to self-citations. The central separation claim rests on the architectural distinction and comparative experiments, not on any self-definitional loop, imported uniqueness result, or ansatz smuggled via prior work by the same authors. This is the most common honest finding for a modular framework paper whose core contribution is an engineering separation rather than a mathematical derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLM parsing produces observations of sufficient quality for Bayesian inference and on the existence of calibrated decision thresholds whose values are not fully specified in the abstract.

free parameters (1)

abstention and decision thresholds
Calibrated thresholds for stopping, deferring, or continuing the conversation are introduced to control selective decision-making; their specific values or fitting procedure are not detailed in the abstract.

axioms (1)

domain assumption Structured observations extracted by the LLM are accurate and complete enough to serve as direct inputs to Bayesian posterior updating.
The framework treats LLM-parsed observations as reliable evidence for probabilistic inference without discussing parsing error rates or robustness checks.

invented entities (1)

MoBayes modular framework no independent evidence
purpose: To enforce separation between language interface and probabilistic reasoning module.
The framework is newly introduced in this work; no independent prior evidence for its components in this exact configuration is referenced in the abstract.

pith-pipeline@v0.9.0 · 5790 in / 1547 out tokens · 72280 ms · 2026-05-20T23:48:56.767791+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BMBE decomposes this into a language interface, an LLM that parses patient utterances and verbalises questions, and a Bayesian reasoning engine that maintains beliefs, selects questions, and renders diagnostic decisions.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The EIG is the reduction in expected entropy: EIG(f) = H(b_t) - sum P(X_f=v|E_t) H(b_f=v_t)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 3 internal anchors

[1]

Shortliffe.Computer-Based Medical Consultations: MYCIN

Edward H. Shortliffe.Computer-Based Medical Consultations: MYCIN. Elsevier, 1976

work page 1976
[2]

F. T. de Dombal, D. J. Leaper, J. R. Staniland, A. P. McCann, and Jane C. Horrocks. Computer- aided diagnosis of acute abdominal pain.British Medical Journal, 2(5804):9–13, 1972

work page 1972
[3]

Miller, Harry E

Randolph A. Miller, Harry E. Pople, and Jack D. Myers. Internist-I, an experimental computer- based diagnostic consultant for general internal medicine.New England Journal of Medicine, 307(8):468–476, 1982

work page 1982
[4]

Octo Barnett, James J

G. Octo Barnett, James J. Cimino, Jon A. Hupp, and Edward P. Hoffer. DXplain: An evolving diagnostic decision-support system.JAMA, 258(1):67–74, 1987

work page 1987
[5]

Heckerman, Eric J

David E. Heckerman, Eric J. Horvitz, and Bharat N. Nathwani. Toward normative expert systems: Part I. The Pathfinder project.Methods of Information in Medicine, 31(2):90–105, 1992

work page 1992
[6]

Greek Oracle

Randolph A. Miller and Fred E. Masarie. The demise of the “Greek Oracle” model for medical diagnostic systems.Methods of Information in Medicine, 29(01):1–2, 1990

work page 1990
[7]

Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Towards expert-level medical question answering with large language models.Nature Medicine, 2025

work page 2025
[8]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, et al. Capabilities of Gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Towards conversational diagnostic AI

Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tober, et al. Towards conversational diagnostic AI. Nature, 2025

work page 2025
[10]

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Subhajit Choudhury, Sinead Williamson, Omar Rivasplata, and Tom Rainforth. BED-LLM: Intelligent information gathering with LLMs and Bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

DeLLMa: Decision making under uncertainty with large language models

Ollie Liu, Deqing Fu, Dani Levy, Maryam Fazel, Adith Swaminathan, and Willie Neiswanger. DeLLMa: Decision making under uncertainty with large language models. InProceedings of the International Conference on Learning Representations (ICLR), 2025. Spotlight. 12

work page 2025
[12]

BIRD: A trustworthy Bayesian inference framework for large language models.arXiv preprint arXiv:2404.12494, 2024

Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. BIRD: A trustworthy Bayesian inference framework for large language models.arXiv preprint arXiv:2404.12494, 2024

work page arXiv 2024
[13]

Ask patients with patience: Enabling LLMs for human-centric medical dialogue with grounded reasoning

Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, and Junde Wu. Ask patients with patience: Enabling LLMs for human-centric medical dialogue with grounded reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2846–2857, 2025

work page 2025
[14]

Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov

Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. MediQ: Question-asking LLMs and a benchmark for reliable interactive clinical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[15]

Fine-tuning large language models with medical data: Can safety be ensured?NEJM AI, 2(1), 2025

Minkyoung Kim, Yunha Kim, Hee Jun Kang, Hyeram Seo, Heejung Choi, JiYe Han, Gaeun Kee, Seohyun Park, Soyoung Ko, Hyoje Jung, Byeolhee Kim, Tae Joon Jun, and Young-Hak Kim. Fine-tuning large language models with medical data: Can safety be ensured?NEJM AI, 2(1), 2025. doi: 10.1056/AIcs2400390

work page doi:10.1056/aics2400390 2025
[16]

Yu, Lawrence M

Victor L. Yu, Lawrence M. Fagan, Sharon M. Wraith, William J. Clancey, A. Carlisle Scott, John Hannigan, Robert L. Blum, Bruce G. Buchanan, and Stanley N. Cohen. Antimicrobial selection by a computer: A blinded evaluation by infectious diseases experts.JAMA, 242(12): 1279–1282, 1979

work page 1979
[17]

Weiss, Casimir A

Sholom M. Weiss, Casimir A. Kulikowski, Saul Amarel, and Aran Safir. A model-based method for computer-aided medical decision-making.Artificial Intelligence, 11(1–2):145–172, 1978

work page 1978
[18]

Abdelzaher

Xinyi Liu, Dachun Sun, Yi Fung, Dilek Hakkani-Tür, and Tarek F. Abdelzaher. DocCHA: Towards LLM-augmented interactive online diagnosis system. InProceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDial), pages 609–619, 2025

work page 2025
[19]

Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment

Kaishuai Xu, Yi Cheng, Wenjun Hou, Qiaoyu Tan, and Wenjie Li. Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6796–6814, 2024

work page 2024
[20]

Thomas Savage, John Wang, Robert Gallo, Abdessalem Boukil, Vishwesh Patel, Seyed Amir Safavi-Naini, Ali Soroush, and Jonathan H. Chen. Large language model uncertainty proxies: Discrimination and calibration for medical diagnosis and treatment.Journal of the American Medical Informatics Association, 32(1):139–149, 2025

work page 2025
[21]

Collins, David Reich, Robert Freeman, and Eyal Klang

Mahmud Omar, Vera Sorin, Jeremy D. Collins, David Reich, Robert Freeman, and Eyal Klang. Multi-model assurance analysis: LLMs highly vulnerable to adversarial hallucination attacks during clinical decision support.Communications Medicine, 5(1):97, 2025

work page 2025
[22]

Omiye, Jenna C

Jesutofunmi A. Omiye, Jenna C. Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou. Large language models propagate race-based medicine.npj Digital Medicine, 6(1): 195, 2023

work page 2023
[23]

Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset

Tobi Olatunji, Charles Nimo, Abraham Owodunni, et al. AfriMed-QA: A pan-African, multi- specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,

work page arXiv
[24]

ACL 2025, Best Social Impact Award

work page 2025
[25]

Conversational disease diagnosis via external planner-controlled large language models.arXiv preprint arXiv:2404.04292, 2024

Zhoujian Sun, Chenghua Luo, Liangzhi Jiang, Linlin Liu, Xiaohan Yang, Junfan Shi, Tangjie Lv, Benyou Zhang, and Kezhi Mao. Conversational disease diagnosis via external planner-controlled large language models.arXiv preprint arXiv:2404.04292, 2024

work page arXiv 2024
[26]

DDXPlus: A new dataset for automatic medical diagnosis

Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. DDXPlus: A new dataset for automatic medical diagnosis. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[27]

Jeffrey.The Logic of Decision

Richard C. Jeffrey.The Logic of Decision. McGraw-Hill, 1965. 13

work page 1965
[28]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

Mohammad Almansoori, Komal Kumar, and Hisham Cholakkal. MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions . Inproceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2025, volume LNCS 15968. Springer Nature Switzerland, September 2025

work page 2025
[30]

AI hospital: Benchmarking large language models in a multi-agent medical interaction simulator

Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang Siyuan, Zhongyu Wei, and Fei Huang. AI hospital: Benchmarking large language models in a multi-agent medical interaction simulator. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Compu-...

work page 2025
[31]

PatientSim: A persona-driven simulator for realistic doctor-patient interactions

Daeun Kyung, Hyunseung Chung, Seongsu Bae, Jiho Kim, Jae Ho Sohn, Taerim Kim, Soo Kyung Kim, and Edward Choi. PatientSim: A persona-driven simulator for realistic doctor-patient interactions. InAdvances in Neural Information Processing Systems, volume 38, 2025

work page 2025
[32]

Automatic interactive evaluation for large language models with state aware patient simulator.arXiv preprint arXiv:2403.08495,

Yusheng Liao, Yutong Meng, Yuhao Wang, Hongcheng Liu, Yanfeng Wang, and Yu Wang. Automatic interactive evaluation for large language models with state aware patient simula- tor.ArXiv, abs/2403.08495, 2024. URL https://api.semanticscholar.org/CorpusID: 268379575

work page arXiv 2024
[33]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[34]

I think I had a fever

Yifan Zhao, Yixiao Hua, Dan Roth, and Jinhao Chen. Probing the multi-turn planning capa- bilities of LLMs via 20 question games. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. 14 Appendix overview This supplementary material is organized as follows: A. Theoretical foundations. . . . . . . . . . . . ....

work page 2024
[35]

Classify the user response into one of the Allowed Values

work page
[36]

Yes", "Not really

Assess confidence: very_likely, likely, uncertain, unlikely, very_unlikely KEY RULES: – Direct answer ("Yes", "Not really"): map to closest value. – Unrelated response: return "unknown|likely". – Uncertain language ("I think so", "maybe"): use "uncertain". – Prefer "unknown" over hard negative when partial/vague. Return format: "value|confidence_level" Ex...

work page
[37]

f_fever",

NEVER use technical IDs (e.g., "f_fever", "d_flu")

work page
[38]

Speak naturally and empathetically

work page
[39]

Do NOT mention probabilities or internal values

work page
[40]

{narrative}

If clarifying a previous confusion, keep it brief. Bulk intake prompt.At session start, a single bulk intake call maps the patient’s opening narrative to multiple(f, v, c)triples simultaneously, reducing the number of follow-up questions needed. Bulk Intake Prompt System: You are an expert medical intake specialist. User Text: "{narrative}" TASK: Extract ...

work page
[41]

Only extract explicitly mentioned or strongly implied features. 18

work page
[42]

Extract demographics (age, gender, location) if present

work page
[43]

Do NOT infer negatives from silence; omit unlisted features

work page
[44]

feature_id

Assess confidence for each extracted feature. Return JSON: {"feature_id": {"value": "...", "confidence": "likely"}, "demographics": {"age": N, ...}} Patient simulator prompt.The patient simulator receives the full clinical profile (demographics, chief complaint, symptoms, medical history, observed findings) and persona instructions. Crucially, the patient...

work page
[45]

Answer based on KNOWN OBSERVED FINDINGS and patient profile

work page
[46]

If asked about something listed: answer faithfully (including denials)

work page
[47]

I’m not sure

If NOT listed: say “I’m not sure” or “I don’t know”. Do not invent symptoms

work page
[48]

NEVER reveal your diagnosis directly

work page
[49]

Keep responses concise (1–3 sentences). PERSONA: – Language: {CEFR level A/B/C} – Personality: {plain|verbose|overanxious|distrustful} – Memory: {high|low recall} – Alertness: {normal|moderate daze|high daze} Standalone doctor prompt.The standalone LLM doctor receives no external reasoning support. It conducts the full diagnostic interview and outputs a d...

work page
[50]

Acute exacerbation of COPD

{prediction_2} ... REFERENCE DISEASE LIST: – {disease_name_1} – {disease_name_2} ... Output ONLY a numbered list with the matched disease name (exactly as written in the reference list) or NO_MATCH. LLM-generated KB prompts.The LLM-generated KB is constructed via two sequential prompts. Thefeature generation promptasks the model to propose clinically plau...

work page 2073
[51]

We then measure the posterior gap between the oracle’s top-1 and top-2 diseases

KB Failure.We run anoracletest: all ground-truth features are supplied at confidence c=1.0. We then measure the posterior gap between the oracle’s top-1 and top-2 diseases. If this gap falls below a thresholdγ(γ=0.80), the KB cannot reliably discriminate the disease pair

work page
[52]

Two subtypes: • False Positive (FP): the engine asks about a featureabsentfrom the patient’s ground-truth profile; the pipeline returns yes

LLM Failure.The LLM pipeline (verbaliser + patient simulator + parser) injected incorrect evidence into the engine. Two subtypes: • False Positive (FP): the engine asks about a featureabsentfrom the patient’s ground-truth profile; the pipeline returns yes. If more than 2 such turns occur in a session, the case is flagged. The threshold reflects the empiri...

work page
[53]

I have chest pain even at rest, upper chest pain, and pleuritic chest pain

Inference Failure.The KB is adequate and the evidence pipeline introduced no detectable errors, yet the engine converged to the wrong diagnosis. Two subtypes: •Close: the ground truth remains in the top-3 posterior at session end, but the question budget or EIG policy did not resolve the differential. • Diverged: the ground truth is not in the top-3. The ...

work page

[1] [1]

Shortliffe.Computer-Based Medical Consultations: MYCIN

Edward H. Shortliffe.Computer-Based Medical Consultations: MYCIN. Elsevier, 1976

work page 1976

[2] [2]

F. T. de Dombal, D. J. Leaper, J. R. Staniland, A. P. McCann, and Jane C. Horrocks. Computer- aided diagnosis of acute abdominal pain.British Medical Journal, 2(5804):9–13, 1972

work page 1972

[3] [3]

Miller, Harry E

Randolph A. Miller, Harry E. Pople, and Jack D. Myers. Internist-I, an experimental computer- based diagnostic consultant for general internal medicine.New England Journal of Medicine, 307(8):468–476, 1982

work page 1982

[4] [4]

Octo Barnett, James J

G. Octo Barnett, James J. Cimino, Jon A. Hupp, and Edward P. Hoffer. DXplain: An evolving diagnostic decision-support system.JAMA, 258(1):67–74, 1987

work page 1987

[5] [5]

Heckerman, Eric J

David E. Heckerman, Eric J. Horvitz, and Bharat N. Nathwani. Toward normative expert systems: Part I. The Pathfinder project.Methods of Information in Medicine, 31(2):90–105, 1992

work page 1992

[6] [6]

Greek Oracle

Randolph A. Miller and Fred E. Masarie. The demise of the “Greek Oracle” model for medical diagnostic systems.Methods of Information in Medicine, 29(01):1–2, 1990

work page 1990

[7] [7]

Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Towards expert-level medical question answering with large language models.Nature Medicine, 2025

work page 2025

[8] [8]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, et al. Capabilities of Gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Towards conversational diagnostic AI

Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tober, et al. Towards conversational diagnostic AI. Nature, 2025

work page 2025

[10] [10]

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Subhajit Choudhury, Sinead Williamson, Omar Rivasplata, and Tom Rainforth. BED-LLM: Intelligent information gathering with LLMs and Bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

DeLLMa: Decision making under uncertainty with large language models

Ollie Liu, Deqing Fu, Dani Levy, Maryam Fazel, Adith Swaminathan, and Willie Neiswanger. DeLLMa: Decision making under uncertainty with large language models. InProceedings of the International Conference on Learning Representations (ICLR), 2025. Spotlight. 12

work page 2025

[12] [12]

BIRD: A trustworthy Bayesian inference framework for large language models.arXiv preprint arXiv:2404.12494, 2024

Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. BIRD: A trustworthy Bayesian inference framework for large language models.arXiv preprint arXiv:2404.12494, 2024

work page arXiv 2024

[13] [13]

Ask patients with patience: Enabling LLMs for human-centric medical dialogue with grounded reasoning

Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, and Junde Wu. Ask patients with patience: Enabling LLMs for human-centric medical dialogue with grounded reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2846–2857, 2025

work page 2025

[14] [14]

Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov

Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. MediQ: Question-asking LLMs and a benchmark for reliable interactive clinical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[15] [15]

Fine-tuning large language models with medical data: Can safety be ensured?NEJM AI, 2(1), 2025

Minkyoung Kim, Yunha Kim, Hee Jun Kang, Hyeram Seo, Heejung Choi, JiYe Han, Gaeun Kee, Seohyun Park, Soyoung Ko, Hyoje Jung, Byeolhee Kim, Tae Joon Jun, and Young-Hak Kim. Fine-tuning large language models with medical data: Can safety be ensured?NEJM AI, 2(1), 2025. doi: 10.1056/AIcs2400390

work page doi:10.1056/aics2400390 2025

[16] [16]

Yu, Lawrence M

Victor L. Yu, Lawrence M. Fagan, Sharon M. Wraith, William J. Clancey, A. Carlisle Scott, John Hannigan, Robert L. Blum, Bruce G. Buchanan, and Stanley N. Cohen. Antimicrobial selection by a computer: A blinded evaluation by infectious diseases experts.JAMA, 242(12): 1279–1282, 1979

work page 1979

[17] [17]

Weiss, Casimir A

Sholom M. Weiss, Casimir A. Kulikowski, Saul Amarel, and Aran Safir. A model-based method for computer-aided medical decision-making.Artificial Intelligence, 11(1–2):145–172, 1978

work page 1978

[18] [18]

Abdelzaher

Xinyi Liu, Dachun Sun, Yi Fung, Dilek Hakkani-Tür, and Tarek F. Abdelzaher. DocCHA: Towards LLM-augmented interactive online diagnosis system. InProceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDial), pages 609–619, 2025

work page 2025

[19] [19]

Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment

Kaishuai Xu, Yi Cheng, Wenjun Hou, Qiaoyu Tan, and Wenjie Li. Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6796–6814, 2024

work page 2024

[20] [20]

Thomas Savage, John Wang, Robert Gallo, Abdessalem Boukil, Vishwesh Patel, Seyed Amir Safavi-Naini, Ali Soroush, and Jonathan H. Chen. Large language model uncertainty proxies: Discrimination and calibration for medical diagnosis and treatment.Journal of the American Medical Informatics Association, 32(1):139–149, 2025

work page 2025

[21] [21]

Collins, David Reich, Robert Freeman, and Eyal Klang

Mahmud Omar, Vera Sorin, Jeremy D. Collins, David Reich, Robert Freeman, and Eyal Klang. Multi-model assurance analysis: LLMs highly vulnerable to adversarial hallucination attacks during clinical decision support.Communications Medicine, 5(1):97, 2025

work page 2025

[22] [22]

Omiye, Jenna C

Jesutofunmi A. Omiye, Jenna C. Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou. Large language models propagate race-based medicine.npj Digital Medicine, 6(1): 195, 2023

work page 2023

[23] [23]

Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset

Tobi Olatunji, Charles Nimo, Abraham Owodunni, et al. AfriMed-QA: A pan-African, multi- specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,

work page arXiv

[24] [24]

ACL 2025, Best Social Impact Award

work page 2025

[25] [25]

Conversational disease diagnosis via external planner-controlled large language models.arXiv preprint arXiv:2404.04292, 2024

Zhoujian Sun, Chenghua Luo, Liangzhi Jiang, Linlin Liu, Xiaohan Yang, Junfan Shi, Tangjie Lv, Benyou Zhang, and Kezhi Mao. Conversational disease diagnosis via external planner-controlled large language models.arXiv preprint arXiv:2404.04292, 2024

work page arXiv 2024

[26] [26]

DDXPlus: A new dataset for automatic medical diagnosis

Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. DDXPlus: A new dataset for automatic medical diagnosis. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022

[27] [27]

Jeffrey.The Logic of Decision

Richard C. Jeffrey.The Logic of Decision. McGraw-Hill, 1965. 13

work page 1965

[28] [28]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

Mohammad Almansoori, Komal Kumar, and Hisham Cholakkal. MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions . Inproceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2025, volume LNCS 15968. Springer Nature Switzerland, September 2025

work page 2025

[30] [30]

AI hospital: Benchmarking large language models in a multi-agent medical interaction simulator

Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang Siyuan, Zhongyu Wei, and Fei Huang. AI hospital: Benchmarking large language models in a multi-agent medical interaction simulator. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Compu-...

work page 2025

[31] [31]

PatientSim: A persona-driven simulator for realistic doctor-patient interactions

Daeun Kyung, Hyunseung Chung, Seongsu Bae, Jiho Kim, Jae Ho Sohn, Taerim Kim, Soo Kyung Kim, and Edward Choi. PatientSim: A persona-driven simulator for realistic doctor-patient interactions. InAdvances in Neural Information Processing Systems, volume 38, 2025

work page 2025

[32] [32]

Automatic interactive evaluation for large language models with state aware patient simulator.arXiv preprint arXiv:2403.08495,

Yusheng Liao, Yutong Meng, Yuhao Wang, Hongcheng Liu, Yanfeng Wang, and Yu Wang. Automatic interactive evaluation for large language models with state aware patient simula- tor.ArXiv, abs/2403.08495, 2024. URL https://api.semanticscholar.org/CorpusID: 268379575

work page arXiv 2024

[33] [33]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[34] [34]

I think I had a fever

Yifan Zhao, Yixiao Hua, Dan Roth, and Jinhao Chen. Probing the multi-turn planning capa- bilities of LLMs via 20 question games. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. 14 Appendix overview This supplementary material is organized as follows: A. Theoretical foundations. . . . . . . . . . . . ....

work page 2024

[35] [35]

Classify the user response into one of the Allowed Values

work page

[36] [36]

Yes", "Not really

Assess confidence: very_likely, likely, uncertain, unlikely, very_unlikely KEY RULES: – Direct answer ("Yes", "Not really"): map to closest value. – Unrelated response: return "unknown|likely". – Uncertain language ("I think so", "maybe"): use "uncertain". – Prefer "unknown" over hard negative when partial/vague. Return format: "value|confidence_level" Ex...

work page

[37] [37]

f_fever",

NEVER use technical IDs (e.g., "f_fever", "d_flu")

work page

[38] [38]

Speak naturally and empathetically

work page

[39] [39]

Do NOT mention probabilities or internal values

work page

[40] [40]

{narrative}

If clarifying a previous confusion, keep it brief. Bulk intake prompt.At session start, a single bulk intake call maps the patient’s opening narrative to multiple(f, v, c)triples simultaneously, reducing the number of follow-up questions needed. Bulk Intake Prompt System: You are an expert medical intake specialist. User Text: "{narrative}" TASK: Extract ...

work page

[41] [41]

Only extract explicitly mentioned or strongly implied features. 18

work page

[42] [42]

Extract demographics (age, gender, location) if present

work page

[43] [43]

Do NOT infer negatives from silence; omit unlisted features

work page

[44] [44]

feature_id

Assess confidence for each extracted feature. Return JSON: {"feature_id": {"value": "...", "confidence": "likely"}, "demographics": {"age": N, ...}} Patient simulator prompt.The patient simulator receives the full clinical profile (demographics, chief complaint, symptoms, medical history, observed findings) and persona instructions. Crucially, the patient...

work page

[45] [45]

Answer based on KNOWN OBSERVED FINDINGS and patient profile

work page

[46] [46]

If asked about something listed: answer faithfully (including denials)

work page

[47] [47]

I’m not sure

If NOT listed: say “I’m not sure” or “I don’t know”. Do not invent symptoms

work page

[48] [48]

NEVER reveal your diagnosis directly

work page

[49] [49]

Keep responses concise (1–3 sentences). PERSONA: – Language: {CEFR level A/B/C} – Personality: {plain|verbose|overanxious|distrustful} – Memory: {high|low recall} – Alertness: {normal|moderate daze|high daze} Standalone doctor prompt.The standalone LLM doctor receives no external reasoning support. It conducts the full diagnostic interview and outputs a d...

work page

[50] [50]

Acute exacerbation of COPD

{prediction_2} ... REFERENCE DISEASE LIST: – {disease_name_1} – {disease_name_2} ... Output ONLY a numbered list with the matched disease name (exactly as written in the reference list) or NO_MATCH. LLM-generated KB prompts.The LLM-generated KB is constructed via two sequential prompts. Thefeature generation promptasks the model to propose clinically plau...

work page 2073

[51] [51]

We then measure the posterior gap between the oracle’s top-1 and top-2 diseases

KB Failure.We run anoracletest: all ground-truth features are supplied at confidence c=1.0. We then measure the posterior gap between the oracle’s top-1 and top-2 diseases. If this gap falls below a thresholdγ(γ=0.80), the KB cannot reliably discriminate the disease pair

work page

[52] [52]

Two subtypes: • False Positive (FP): the engine asks about a featureabsentfrom the patient’s ground-truth profile; the pipeline returns yes

LLM Failure.The LLM pipeline (verbaliser + patient simulator + parser) injected incorrect evidence into the engine. Two subtypes: • False Positive (FP): the engine asks about a featureabsentfrom the patient’s ground-truth profile; the pipeline returns yes. If more than 2 such turns occur in a session, the case is flagged. The threshold reflects the empiri...

work page

[53] [53]

I have chest pain even at rest, upper chest pain, and pleuritic chest pain

Inference Failure.The KB is adequate and the evidence pipeline introduced no detectable errors, yet the engine converged to the wrong diagnosis. Two subtypes: •Close: the ground truth remains in the top-3 posterior at session end, but the question budget or EIG policy did not resolve the differential. • Diverged: the ground truth is not in the top-3. The ...

work page