MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

Akhil Arora; Alexandra Kulinkina; David Sasu; Fay Elhassan; Jiayi Ma; Julien Stalhandske; Lars Klein; Mary-Anne Hartley; Yena Chang; Yusuf Kesmen

arxiv: 2604.20022 · v3 · pith:EQMMM6FJnew · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.CL

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

Yusuf Kesmen , Fay Elhassan , Jiayi Ma , Julien Stalhandske , Yena Chang , David Sasu , Alexandra Kulinkina , Akhil Arora

show 2 more authors

Lars Klein Mary-Anne Hartley

This is my paper

Pith reviewed 2026-05-20 23:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords Bayesian inferenceclinical decision supportlarge language modelsmodular AIconversational systemsprobabilistic reasoningmedical diagnosishybrid AI systems

0 comments

The pith

Separating language parsing from Bayesian inference lets smaller LLMs outperform larger standalone models in clinical conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that LLMs currently used for medical advice mix next-token prediction with actual probabilistic decision making, which creates problems like missing uncertainty tracking and hard-to-audit reasoning. MoBayes addresses this by restricting the LLM to turning patient dialogue into structured observations while a separate Bayesian module updates probabilities, chooses questions to maximize information, and sets clear thresholds for answering or deferring. Tests across real and generated medical knowledge bases show the hybrid system beats frontier LLMs, including cases where cheap sensor models plus MoBayes beat bigger autonomous models at lower cost. The gains hold even when patients communicate in adversarial or varied styles. The core suggestion is that reliable conversational clinical support comes from splitting probabilistic reasoning away from language generation rather than relying on model scale alone.

Core claim

MoBayes is a modular framework in which an LLM functions only as a language interface that converts unstructured patient conversations into structured observations, while an independent Bayesian inference module maintains and updates posterior probabilities over diagnostic hypotheses, selects follow-up questions according to expected information gain, and applies calibrated thresholds to decide when to output a diagnosis, ask more, or abstain. This separation produces explicit, trackable posteriors and allows the statistical backend to be swapped for population-specific models without retraining the language component. Across both empirical and LLM-generated knowledge bases the resulting end

What carries the argument

The MoBayes modular split, in which the LLM serves solely as a parser of patient dialogue into structured observations while a Bayesian module performs all posterior updating, question selection, and decision-threshold control.

If this is right

Explicit posterior tracking allows controllable abstention thresholds and auditable reasoning chains.
Population-specific statistical backends can be swapped without retraining the language model.
Cost advantages appear when pairing inexpensive LLMs with the Bayesian module rather than scaling the language model alone.
Performance advantages remain under adversarial patient communication styles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of language interface from probabilistic core could be tested in non-medical domains that require both natural dialogue and calibrated uncertainty, such as legal intake or financial advice.
Replacing the current Bayesian backend with more expressive graphical models or causal graphs might further improve calibration without changing the LLM component.
If parsing errors prove the main failure mode, targeted fine-tuning of the LLM solely on observation extraction could be a focused improvement path.

Load-bearing premise

The LLM can reliably convert unstructured patient conversations into structured observations that are accurate enough for the Bayesian module to produce correct posterior estimates and decisions.

What would settle it

An experiment showing that LLM parsing errors systematically shift posteriors or decision thresholds enough to produce measurably worse clinical accuracy or safety than a matched standalone LLM.

Figures

Figures reproduced from arXiv: 2604.20022 by Akhil Arora, Alexandra Kulinkina, David Sasu, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Lars Klein, Mary-Anne Hartley, Yena Chang, Yusuf Kesmen.

**Figure 1.** Figure 1: (a) Three paradigms for LLM-based diagnostic dialogue. Standalone: the LLM handles all reasoning, questioning, and diagnosis internally. LLM Bayesian: an external module computes EIG from LLM-derived posteriors, principled question selection, but no grounded knowledge base. BMBE (ours): the LLM serves only as a sensor; all diagnostic reasoning is performed by a deterministic Bayesian engine grounded in an … view at source ↗

**Figure 2.** Figure 2: Overview of the BMBE architecture. The LLM layer handles only language: parsing [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left: DHS vs. API cost per token. Right: DHS vs. estimated cost per patient. In both views, BMBE sensors (circles) achieve higher DHS than standalone doctors (squares) at 10–18× lower cost. 60 70 80 90 100 Coverage (%) 40 50 60 70 80 90 100 Selective Accuracy (%) default GPT-5.4 Gemini 3.1 Pro GPT-OSS-120B Llama-4-Maverick Qwen 3.6+ Kimi K2.5 Triage (τ→0) Balanced (τ=0.50) Safety-critical (τ=0.90) BMBE + G… view at source ↗

**Figure 4.** Figure 4: Operating point control. The green curve shows the accuracy, coverage frontier of BMBE + [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Left: DDXPlus prior distribution sorted by prevalence. The long tail (max / min ≈ 200×) reflects real-world disease frequency; the dashed line shows the uniform baseline. Right: Distribution of positive evidence count per evaluation patient across KBs. two variants: one using GPT-5.4 and one using Gemini 3.1, enabling a cross-model comparison of zero-shot medical knowledge. Construction proceeds in two sta… view at source ↗

**Figure 6.** Figure 6: Left: Distribution of LLM-elicited binary likelihoods P(yes | d) for both GPT and Gemini KBs; the strong left skew indicates that most disease–feature associations are weak. Right: CDF of per-pair KL divergence from uniform across all three KBs; DDXPlus (empirical) has the highest informativeness, while both LLM-KBs are comparable despite being synthetically generated. agreement on medical knowledge. GPT g… view at source ↗

**Figure 7.** Figure 7: Left: Scatter plot of P(yes | d) for 45 shared features across 18 diseases (n=810 pairs); dashed line is perfect agreement. Right: Distribution of pairwise likelihood differences; the leftskewed distribution (mean = −0.055) confirms Gemini’s systematically higher assignments. 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Variance of P(yes ∣ d) across diseases 0 5 10 15 20 25 30 Number of features GPT-5.4 Gemini 2.5 … view at source ↗

**Figure 8.** Figure 8: Feature discriminativeness (left: cross-disease variance; right: cross-disease range) for [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: DDXPlus selective accuracy vs. coverage ( [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: DHS across patient personas. Shaded areas show degradation from the plain baseline [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Top-1 accuracy vs. KB size K. BMBE remains stable across a 4× increase in disease space; the standalone doctor is flat regardless of K [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: illustrates the engine’s belief dynamics. The left panel tracks the posterior of the groundtruth disease across turns for a representative case: competing hypotheses rise and fall as evidence accumulates. The right panel aggregates entropy trajectories across all sessions, separating correct and incorrect diagnoses: correct cases exhibit steady entropy collapse, while incorrect cases plateau at elevated … view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an architectural limitation: such systems lack explicit posterior tracking, controllable abstention thresholds, and auditable reasoning chains. We introduce MoBayes, a Modular Bayesian dialogue framework that separates reasoning from language. The LLM acts only as a language interface, parsing patient conversation into structured observations, while a Bayesian module performs probabilistic inference over these observations to update posteriors, select follow-up questions via expected-information-gain and determine when to stop or defer through calibrated decision thresholds. This design enables explicit posterior tracking, controllable selective decision-making, and replaceable population-specific statistical backends without retraining the language model. Across empirical and LLM-generated knowledge bases, MoBayes outperforms standalone frontier LLM doctors, including matched model-family comparisons where inexpensive sensor models paired with MoBayes exceed larger autonomous models at lower cost. The advantage persists under adversarial patient communication styles and across varying diagnostic scenarios. These results suggest that reliable conversational clinical decision support systems should separate probabilistic reasoning from language generation rather than scaling model size alone. Code is available at https://anonymous.4open.science/r/MoBayes/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoBayes keeps the LLM to observation extraction only and lets a Bayesian backend handle posteriors, question selection, and abstention, with reported gains over pure LLMs even in matched comparisons.

read the letter

The main point is that this paper splits the work so the LLM only turns patient conversation into structured observations while a Bayesian module does the actual probabilistic updating, picks questions by expected information gain, and sets abstention thresholds. That separation is the concrete contribution, and it allows swapping in different population-specific backends without retraining the language model. They show the combined system beating standalone frontier LLMs, including cheaper sensor models plus MoBayes outperforming larger autonomous ones, and the edge holds across diagnostic scenarios and adversarial patient styles. Code release helps with checking the details. The results draw from both empirical and LLM-generated knowledge bases, which gives some breadth to the tests. The architecture makes posterior tracking and decision thresholds explicit, which addresses a real limitation in end-to-end LLM clinical tools. The soft spot is the parsing step. The claims rest on the LLM producing clean, accurate observations that feed directly into the Bayesian updates. The abstract asserts robustness under adversarial communication, but without reported parsing error rates, inter-annotator checks, or ablations that isolate observation quality from inference quality, it is hard to rule out that gains come from unusually clean inputs rather than the modular design itself. If parsing noise is material, the posteriors and question selection would be systematically off. This is aimed at people working on clinical decision support or hybrid neuro-symbolic systems who need controllable uncertainty handling. Readers looking for practical ways to add auditability to conversational medical AI would get value from the architecture and the comparisons. The work is coherent on its own terms and has empirical backing plus code, so it deserves a serious referee rather than a desk reject. I would send it for review but flag the parsing validation and any sensitivity tests around observation errors as points for the referees to examine closely.

Referee Report

2 major / 2 minor

Summary. The paper introduces MoBayes, a modular Bayesian dialogue framework for conversational clinical decision support. An LLM is used only as a language interface to parse unstructured patient conversations into structured observations (symptoms, history, etc.), while a separate Bayesian module performs posterior updating, selects follow-up questions via expected information gain, and applies calibrated thresholds for abstention or deferral. The central empirical claim is that this architecture outperforms standalone frontier LLMs (including matched model-family comparisons), remains robust under adversarial patient communication styles, and enables cost-effective pairings of small sensor models with MoBayes that exceed larger autonomous LLMs, across both empirical and LLM-generated knowledge bases. Code is released.

Significance. If the separation of parsing from probabilistic inference can be shown to be robust, the work would provide a concrete demonstration that explicit posterior tracking and controllable decision thresholds improve reliability and auditability over pure next-token prediction in clinical settings. The modular design also offers practical advantages in swapping population-specific statistical backends without retraining the language model. The public code release is a positive contribution to reproducibility.

major comments (2)

[Abstract and experimental evaluation section] The central performance claims rest on the assumption that the LLM parser converts free-form dialogue into structured observations without introducing systematic errors that bias the Bayesian posteriors or expected-information-gain calculations. No quantitative parsing error rates, inter-annotator agreement scores, or ablation experiments that isolate parsing noise from inference quality are reported, even under the adversarial communication styles highlighted in the abstract. This omission makes it impossible to determine whether reported gains are attributable to the modular architecture or to cleaner inputs.
[Abstract and §4 (Experiments)] The abstract asserts outperformance and robustness, yet the manuscript provides no details on baseline definitions, statistical significance tests, data exclusion criteria, or how adversarial patient styles were operationalized and measured. Without these, the degree to which the data support the claim that inexpensive sensor models + MoBayes exceed larger autonomous models cannot be verified.

minor comments (2)

[§3 (Methodology)] Notation for the mapping from parsed observations to likelihood functions should be made explicit, ideally with a small example or pseudocode.
[Results figures] Figure captions and axis labels in the results section would benefit from greater clarity regarding which curves correspond to MoBayes versus baseline LLM configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify gaps in experimental reporting that limit the interpretability of our results. We address each major comment below and will incorporate the requested clarifications and additional analyses in the revised manuscript.

read point-by-point responses

Referee: [Abstract and experimental evaluation section] The central performance claims rest on the assumption that the LLM parser converts free-form dialogue into structured observations without introducing systematic errors that bias the Bayesian posteriors or expected-information-gain calculations. No quantitative parsing error rates, inter-annotator agreement scores, or ablation experiments that isolate parsing noise from inference quality are reported, even under the adversarial communication styles highlighted in the abstract. This omission makes it impossible to determine whether reported gains are attributable to the modular architecture or to cleaner inputs.

Authors: We agree that explicit quantification of parser performance is necessary to substantiate the separation of language parsing from probabilistic reasoning. Although the current experiments demonstrate that performance advantages persist across both clean and adversarially perturbed dialogues, we did not report direct parsing accuracy metrics or controlled ablations. In the revision we will add (i) parsing error rates computed against ground-truth structured observations on a held-out dialogue set and (ii) an ablation comparing Bayesian inference quality with noisy versus oracle-clean parsed inputs. These additions will allow readers to isolate the contribution of the modular Bayesian component. revision: yes
Referee: [Abstract and §4 (Experiments)] The abstract asserts outperformance and robustness, yet the manuscript provides no details on baseline definitions, statistical significance tests, data exclusion criteria, or how adversarial patient styles were operationalized and measured. Without these, the degree to which the data support the claim that inexpensive sensor models + MoBayes exceed larger autonomous models cannot be verified.

Authors: We acknowledge that the experimental section lacked sufficient methodological detail. The baselines consist of standalone LLMs from the same model families used as sensor models in MoBayes; statistical comparisons were performed via paired t-tests over multiple random seeds, and adversarial styles were generated through targeted prompt modifications (e.g., vague, contradictory, or evasive patient responses). Data exclusion was limited to dialogues lacking any symptom or history information. In the revised §4 we will explicitly define all baselines, report p-values and confidence intervals, detail the prompt templates used for adversarial styles, and state the precise exclusion criteria. These clarifications will make the empirical support for the cost-effective sensor-model + MoBayes comparisons fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; modular separation uses standard Bayesian updating on parsed inputs

full rationale

The paper describes a design in which an LLM parses unstructured dialogue into structured observations that then serve as inputs to a conventional Bayesian update with expected information gain for question selection. No equations, fitted parameters, or performance metrics are shown to be defined in terms of the evaluation outcomes themselves. The claimed advantages are presented as empirical results across knowledge bases and adversarial scenarios rather than as quantities that reduce by construction to the same data or to self-citations. The central separation claim rests on the architectural distinction and comparative experiments, not on any self-definitional loop, imported uniqueness result, or ansatz smuggled via prior work by the same authors. This is the most common honest finding for a modular framework paper whose core contribution is an engineering separation rather than a mathematical derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLM parsing produces observations of sufficient quality for Bayesian inference and on the existence of calibrated decision thresholds whose values are not fully specified in the abstract.

free parameters (1)

abstention and decision thresholds
Calibrated thresholds for stopping, deferring, or continuing the conversation are introduced to control selective decision-making; their specific values or fitting procedure are not detailed in the abstract.

axioms (1)

domain assumption Structured observations extracted by the LLM are accurate and complete enough to serve as direct inputs to Bayesian posterior updating.
The framework treats LLM-parsed observations as reliable evidence for probabilistic inference without discussing parsing error rates or robustness checks.

invented entities (1)

MoBayes modular framework no independent evidence
purpose: To enforce separation between language interface and probabilistic reasoning module.
The framework is newly introduced in this work; no independent prior evidence for its components in this exact configuration is referenced in the abstract.

pith-pipeline@v0.9.0 · 5790 in / 1547 out tokens · 72280 ms · 2026-05-20T23:48:56.767791+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BMBE decomposes this into a language interface, an LLM that parses patient utterances and verbalises questions, and a Bayesian reasoning engine that maintains beliefs, selects questions, and renders diagnostic decisions.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The EIG is the reduction in expected entropy: EIG(f) = H(b_t) - sum P(X_f=v|E_t) H(b_f=v_t)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.