COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

Chuanzhi Xu; Xiaozhen Zhong; Zihan Deng

arxiv: 2605.15016 · v1 · pith:S4HKZEL5new · submitted 2026-05-14 · 💻 cs.CL · cs.AI

COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

Zihan Deng , Xiaozhen Zhong , Chuanzhi Xu This is my paper

Pith reviewed 2026-06-30 20:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords COTCAgentchain-of-thought completionlongitudinal EHRmedical reasoningtemporal statisticsdisease risk scoringclinical decision supportpreventive consultation

0 comments

The pith

COTCAgent lets language models reason over patient records across time by turning analysis plans into code and scoring risks against a symptom-trend-disease base.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COTCAgent to fix two specific problems in large language models that analyze longitudinal electronic health records: they invent clinical trends when quantitative details appear only in text, and they miss long-range patterns in irregular time series with few labels. The framework addresses this through three modules that convert plans into executable trend calculations, evaluate disease probabilities with a weighted knowledge base, and gather evidence via constrained questions. This separation of statistical work from language output removes the need for complex multi-modal data. A sympathetic reader would see the potential for more dependable preventive medical advice drawn from full patient histories rather than single visits.

Core claim

The central claim is that the Probabilistic Chain-of-Thought Completion Agent overcomes LLM limitations in longitudinal EHR reasoning by decoupling statistical computation, feature matching, and language generation; its Temporal-Statistics Adapter produces standardized trend outputs from code, its Chain-of-Thought Completion layer scores disease risk via a symptom-trend-disease knowledge base with weighted scoring, and its bounded completion module enforces rigorous evidence gathering through iterative inquiries, yielding 90.47 percent Top-1 accuracy on a self-built dataset and 70.41 percent on HealthBench while using lower computational overhead than prior medical agents.

What carries the argument

The Chain-of-Thought Completion (COTC) layer that applies weighted scoring from a symptom-trend-disease knowledge base to evaluate disease risk after the Temporal-Statistics Adapter converts plans into executable trend code.

If this is right

Medical agents gain higher accuracy than existing systems on both the self-built dataset and HealthBench by avoiding trend hallucinations.
Longitudinal records with non-uniform timing become analyzable without requiring the base model to perform fine-grained statistical reasoning internally.
Preventive consultation becomes feasible through structured, iterative scoring that maintains evidence constraints across time steps.
Analysis runs with lower overhead because the framework avoids complex multi-modal inputs and relies on code execution for trends.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular split between code-based statistics and language-based inference could be adapted to other time-series domains that currently suffer from hallucinated metrics.
If the weighted scoring proves stable, the approach would reduce the need for extensive fine-tuning of large models on medical text alone.
Deployment in live clinical systems would require checking whether the knowledge base needs periodic updates to reflect new disease patterns.

Load-bearing premise

The symptom-trend-disease knowledge base together with its weighted scoring produces unbiased risk estimates that generalize beyond the self-built dataset used for development.

What would settle it

Testing COTCAgent on a new longitudinal EHR collection from a different clinical source where its Top-1 accuracy falls below that of standard large language models without the framework would show the risk estimates do not hold.

Figures

Figures reproduced from arXiv: 2605.15016 by Chuanzhi Xu, Xiaozhen Zhong, Zihan Deng.

**Figure 2.** Figure 2: COTC module wiring: TSA narratives and structured cues feed the KB matcher, producing [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Conversational suites under matched decoding: MedQA and HealthBench accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Representation probes for model variants (coherence, temporal smoothness, semantic [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of staged tool use + structured completion ( [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Radar view of multiplicity in the Symptom–Trend–Disease KB (symptoms, trends, and [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

read the original abstract

As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COTCAgent's 90% on the self-built set versus 70% on HealthBench likely traces to the symptom-trend-disease KB and weights being tied to the development data.

read the letter

The main thing to know is that this paper builds an agent with three pieces: a Temporal-Statistics Adapter that turns plans into code for trend stats, a COTC layer that scores disease risk against a symptom-trend-disease knowledge base, and a bounded completion module for structured follow-up questions. The reported numbers are 90.47% top-1 on the self-built longitudinal EHR set and 70.41% on HealthBench using Baichuan-M2.

What is actually new is the specific assembly of code execution for quantitative trends with a weighted KB for risk, plus the attempt to keep statistical work outside the LLM to cut hallucinations. The code release on GitHub is a plus and lets others inspect the implementation.

The soft spot is the KB itself. The weighted scoring is load-bearing for the disease risk claims, yet the description gives no independent construction details or ablations that would show the weights were set without reference to the self-built dataset. The performance drop on the external benchmark matches exactly what would happen under overfitting, and the other modules feed into the same scoring step so they do not resolve it.

This is for people working on medical decision-support agents who want a concrete example of mixing code-based stats with KB scoring. A reader already deep in that area might pick up a module or two to try, but the evidence for reliable generalization is thin.

I would send it for peer review only if the authors add clear documentation on KB construction, weight derivation, and at least one ablation that isolates the scoring component. Without those the central claim does not hold up well enough for serious referee time.

Referee Report

2 major / 1 minor

Summary. The paper introduces COTCAgent, a hierarchical reasoning framework for longitudinal EHR analysis consisting of the Temporal-Statistics Adapter (TSA) to convert analytical plans into executable code for trend output, the Chain-of-Thought Completion (COTC) layer that leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, and a bounded completion module for structured evidence via standardized inquiries and iterative constraints. It claims that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on a self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream LLMs, while decoupling statistical computation from language generation to reduce hallucinations and computational overhead.

Significance. If the weighted scoring mechanism proves independent of the development data and the performance gains are attributable to the proposed modules rather than overfitting, the work could advance reliable temporal reasoning in medical LLMs by providing a structured way to integrate probabilistic risk evaluation without complex multi-modal inputs. The public code release at the provided GitHub link is a positive factor supporting potential reproducibility.

major comments (2)

[COTC layer] COTC layer: The symptom-trend-disease knowledge base and its weighted scoring mechanism are presented as enabling probabilistic completion and disease risk evaluation, but the manuscript provides no details on how the weights or KB entries are constructed or validated. If these weights are derived from or tuned against the self-built longitudinal EHR dataset (as implied by the 'self-built dataset used for development' phrasing), the reported accuracy gap (90.47% vs. 70.41%) is consistent with overfitting rather than robust generalization; this directly undermines the central claim that the framework enables reliable clinical reasoning beyond the development data.
[Experimental results] Experimental results: The reported Top-1 accuracies and outperformance claims lack supporting details such as dataset statistics, baseline implementations, ablation studies on the TSA/COTC/bounded completion components, or error analysis. Without these, it is impossible to assess whether the accuracies support the claim that the framework's decoupling of statistical computation and feature matching drives the gains.

minor comments (1)

The title emphasizes 'Preventive Consultation' but the abstract and described modules focus on diagnostic risk evaluation; clarify the distinction or scope if preventive aspects are intended.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses

Referee: [COTC layer] COTC layer: The symptom-trend-disease knowledge base and its weighted scoring mechanism are presented as enabling probabilistic completion and disease risk evaluation, but the manuscript provides no details on how the weights or KB entries are constructed or validated. If these weights are derived from or tuned against the self-built longitudinal EHR dataset (as implied by the 'self-built dataset used for development' phrasing), the reported accuracy gap (90.47% vs. 70.41%) is consistent with overfitting rather than robust generalization; this directly undermines the central claim that the framework enables reliable clinical reasoning beyond the development data.

Authors: We agree that the current manuscript provides insufficient detail on KB construction and weight assignment, which is a valid concern. The manuscript does not describe these processes. In revision we will add a dedicated subsection explaining that KB entries are derived from publicly available medical literature and clinical guidelines (e.g., established symptom-disease associations), with weights assigned according to published probabilistic risk factors rather than optimized on any evaluation data. We will also state explicitly that the self-built dataset is used exclusively for final evaluation and was not involved in KB or weight development. The maintained performance advantage on the independent HealthBench dataset will be highlighted as evidence against overfitting. revision: yes
Referee: [Experimental results] Experimental results: The reported Top-1 accuracies and outperformance claims lack supporting details such as dataset statistics, baseline implementations, ablation studies on the TSA/COTC/bounded completion components, or error analysis. Without these, it is impossible to assess whether the accuracies support the claim that the framework's decoupling of statistical computation and feature matching drives the gains.

Authors: We acknowledge that the experimental section is missing these supporting elements. The manuscript currently reports only aggregate accuracies without dataset statistics, implementation details for baselines, component ablations, or error analysis. In the revision we will add: (1) full dataset statistics including record counts, time spans, and patient demographics for both the self-built set and HealthBench; (2) descriptions of how each baseline was implemented and prompted; (3) ablation results quantifying the contribution of TSA, COTC, and bounded completion; and (4) a categorized error analysis. These additions will directly support the claims regarding the benefits of decoupling statistical computation from language generation. revision: yes

Circularity Check

1 steps flagged

COTC weighted scoring and symptom-trend-disease KB appear derived from self-built dataset, rendering 90.47% accuracy a fitted result rather than independent prediction

specific steps

fitted input called prediction [COTC layer (abstract and method description)]
"The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. ... Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset"

The KB and weighted scoring are presented as the mechanism for risk evaluation and are tied to the self-built longitudinal EHRs used for development. When the same dataset supplies both the weights/parameters and the reported accuracy, the 90.47% figure is statistically forced rather than an independent test of the framework's reasoning; the HealthBench result does not rescue the self-built claim.

full rationale

The paper's central performance claim rests on the COTC layer's probabilistic completion via a symptom-trend-disease knowledge base and its weighted scoring. The abstract explicitly ties this KB to the self-built dataset used for development, with no description of independent construction, external validation, or ablation showing the weights are parameter-free or derived outside the evaluation set. This matches the fitted-input-called-prediction pattern: the scoring mechanism evaluates disease risk on data from which its parameters were likely obtained, so the Top-1 accuracy on that set reduces to the input by construction. The gap to HealthBench (70.41%) is consistent with overfitting rather than robust generalization. TSA and bounded-completion modules feed into the same scoring step without breaking the dependency. No equations or self-citations are provided that would establish the KB/weights as externally fixed.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Assessment uses only the abstract; full parameter lists, knowledge-base construction details, and dataset provenance are unavailable.

free parameters (1)

weights in symptom-trend-disease scoring
Weighted scoring for disease risk evaluation is invoked in the COTC layer and is expected to be tuned to data.

axioms (1)

domain assumption The symptom-trend-disease knowledge base provides accurate and unbiased mappings for risk evaluation
Directly invoked when the COTC layer uses the base for weighted scoring.

invented entities (2)

Temporal-Statistics Adapter (TSA) no independent evidence
purpose: Convert analytical plans into executable code for standardized trend output
New module introduced to decouple statistical computation from language generation.
Bounded completion module no independent evidence
purpose: Acquire structured evidence through standardized inquiries and iterative scoring constraints
New module introduced to enforce rigorous reasoning.

pith-pipeline@v0.9.1-grok · 5818 in / 1433 out tokens · 36720 ms · 2026-06-30T20:35:13.905257+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 14 canonical work pages · 2 internal anchors

[1]

S. Zhou, Z. Xu, M. Zhang, C. Xu, Y . Guo, Z. Zhan, and R. Zhang. Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence, 1(1):9, 2025

2025
[2]

Context clues: Evaluating long context models for clinical prediction tasks on EHRs.arXiv preprint arXiv:2412.16178, 2024

Michael Wornow, Suhana Bedi, Miguel Angel Fuentes Hernandez, Ethan Steinberg, Jason Alan Fries, Christopher Ré, Sanmi Koyejo, and Nigam H Shah. Context clues: Evaluating long context models for clinical prediction tasks on EHRs.arXiv preprint arXiv:2412.16178, 2024

work page arXiv 2024
[3]

X. Zhou, J. Zhou, C. Wang, Q. Xie, K. Ding, C. Mao, and Y . Luo. Ph-llm: Public health large language models for infoveillance.medRxiv, 2025. Preprint

2025
[4]

N. Chan, F. Parker, W. Bennett, T. Wu, M. Y . Jia, J. Fackler, and K. Ghobadi. Medtsllm: Leveraging llms for multimodal medical time series analysis.arXiv preprint arXiv:2408.07773, 2024

work page arXiv 2024
[5]

M. Feli, I. Azimi, P. Liljeberg, and A. M. Rahmani. An llm-powered agent for physiological data analysis: A case study on ppg-based heart rate estimation.arXiv preprint arXiv:2502.12836, 2025

work page arXiv 2025
[6]

Shaik, X

T. Shaik, X. Tao, L. Li, H. Xie, H. N. Dai, F. Zhao, and J. Yong. Adaptive multi-agent deep reinforcement learning for timely healthcare interventions.arXiv preprint arXiv:2309.10980, 2023

work page arXiv 2023
[7]

H. Nori, M. Daswani, C. Kelly, S. Lundberg, M. T. Ribeiro, M. Wilson, and E. Horvitz. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

work page arXiv 2025
[8]

Milad Mirbabaie, Stefan Stieglitz, and Nicholas RJ Frick. Artificial intelligence in disease diagnostics: A critical review and classification on the current state of research guiding future direction.Health and Technology, 11(4):693–731, 2021

2021
[9]

High-performance medicine: the convergence of human and artificial intelligence

Eric J Topol. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1):44–56, 2019

2019
[10]

Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

2025
[11]

Capabilities of gemini models in medicine.Google Research, 2023

Daniel McDuff, Mohammad Norouzi, Scott Lundberg, Jianfeng Gao, Emre Kiciman, Saurabh Gombar, Karan Patel, Brian Lansdell, Chun Hwei Teo, Chunyuan Liao, et al. Capabilities of gemini models in medicine.Google Research, 2023. Preprint

2023
[12]

A guide to deep learning in healthcare.Nature Medicine, 25(1):24–29, 2019

Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, V olodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare.Nature Medicine, 25(1):24–29, 2019

2019
[13]

arXiv preprint arXiv:2503.17407 , year=

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A comprehensive survey on long context language modeling.arXiv preprint arXiv:2503.17407, 2025

work page arXiv 2025
[14]

Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2402.11588, 2024

Xiao Wang, Yifan Li, Ming Zhang, Yuxiao Zhang, Yuan Liu, Xiang Liu, and Rui Zhang. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2402.11588, 2024

work page arXiv 2024
[15]

Agentmath: Empowering mathematical reasoning for large language models via tool-augmented agent,

Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng, Yufei Wang, Tao Yang, Han Hu, Yansong Tang, and Di Wang. Agentmath: Empowering mathematical reasoning for large language models via tool-augmented agent.arXiv preprint arXiv:2512.20745, 2025

work page arXiv 2025
[16]

ChatDev: Communicative Agents for Software Development

Chen Qian, Xin Cong, Cheng Yang, Weilin Chen, Juyoung Su, Jiayi Zhang, Yuxiao Zhang, Yuan Liu, and Yuan Li. Communicative agents for software development.arXiv preprint arXiv:2307.07924, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Improving llm mathematical reasoning capabilities using external tools, 2025

Jack Albright and Sheden Andemicael. Improving llm mathematical reasoning capabilities using external tools, 2025. 10

2025
[18]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yuxuan Liu, Tianyu Han, Jie Han, Yuan Li, Hao Zhang, Zhengyan Liu, Jiawei Liu, Xincan Liu, Zihan Liu, Xiao Liu, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InInternational Conference on Learning Representations, 2024

2024
[19]

MIT Press, 2009

Daphne Koller and Nir Friedman.Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009

2009
[20]

A survey on unifying large language models and knowledge graphs for biomedicine and healthcare

Ran Xu, Patrick Jiang, Linhao Luo, Cao Xiao, Adam Cross, Shirui Pan, Jimeng Sun, and Carl Yang. A survey on unifying large language models and knowledge graphs for biomedicine and healthcare. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6195–6205, 2025

2025
[21]

Biobridge: Bridging biomedical foundation models via knowledge graphs.arXiv preprint arXiv:2310.03320, 2023

Zifeng Wang, Zichen Wang, Balasubramaniam Srinivasan, Vassilis N Ioannidis, Huzefa Rang- wala, and Rishita Anubhai. Biobridge: Bridging biomedical foundation models via knowledge graphs.arXiv preprint arXiv:2310.03320, 2023

work page arXiv 2023
[22]

Álvaro García-Barragán, Ahmad Sakor, Maria-Esther Vidal, Ernestina Menasalvas, Juan Cristo- bal Sanchez Gonzalez, Mariano Provencio, and Víctor Robles. Nssc: a neuro-symbolic ai system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes.Medical & Biological Engineering & Computing, 63(3):749–772, 2025

2025
[23]

Integrating knowledge graphs with symbolic ai: The path to interpretable hybrid ai systems in medicine.Journal of Web Semantics, 84:100856, 2025

Maria-Esther Vidal, Yashrajsinh Chudasama, Hao Huang, Disha Purohit, and Maria Torrente. Integrating knowledge graphs with symbolic ai: The path to interpretable hybrid ai systems in medicine.Journal of Web Semantics, 84:100856, 2025

2025
[24]

Neurosymbolic ai for reasoning on biomedical knowledge graphs.arXiv preprint arXiv:2307.08411, 2023

Lauren Nicole DeLong, Ramon Fernández Mir, Zonglin Ji, Fiona Niamh Coulter Smith, and Jacques D Fleuriot. Neurosymbolic ai for reasoning on biomedical knowledge graphs.arXiv preprint arXiv:2307.08411, 2023

work page arXiv 2023
[25]

Brittleness and promise: Knowledge graph based reward modeling for diagnostic reasoning.arXiv preprint arXiv:2509.18316, 2025

Saksham Khatwani, He Cheng, Majid Afshar, Dmitriy Dligach, and Yanjun Gao. Brittleness and promise: Knowledge graph based reward modeling for diagnostic reasoning.arXiv preprint arXiv:2509.18316, 2025

work page arXiv 2025
[26]

Reasoning-enhanced healthcare predictions with knowledge graph community retrieval.arXiv preprint arXiv:2410.04585, 2024

Pengcheng Jiang, Cao Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, and Jiawei Han. Reasoning-enhanced healthcare predictions with knowledge graph community retrieval.arXiv preprint arXiv:2410.04585, 2024

work page arXiv 2024
[27]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021
[28]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Learning to ask like a physician: a discharge summary clinical questions (discq) dataset.bioRxiv preprint, 2025

Eric Lehman. Learning to ask like a physician: a discharge summary clinical questions (discq) dataset.bioRxiv preprint, 2025. Dataset/draft as cited DiSCQ

2025
[30]

Time-mmd: Multi-domain multimodal dataset for time series analysis.Advances in Neural Information Processing Systems, 37:77888–77933, 2024

Haoxin Liu, Shangqing Xu, Zhiyuan Zhao, Lingkai Kong, Harshavardhan Prabhakar Kamarthi, Aditya Sasanur, Megha Sharma, Jiaming Cui, Qingsong Wen, Chao Zhang, et al. Time-mmd: Multi-domain multimodal dataset for time series analysis.Advances in Neural Information Processing Systems, 37:77888–77933, 2024

2024
[31]

G. Lee, W. Yu, K. Shin, W. Cheng, and H. Chen. Timecap: Learning to contextualize, augment, and predict time series events with large language model agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18082–18090, April 2025

2025
[32]

En- hancing healthcare decision support through explainable ai models for risk prediction.Decision Support Systems, 181:114228, 2024

Shuai Niu, Qing Yin, Jing Ma, Yunya Song, Yida Xu, Liang Bai, Wei Pan, and Xian Yang. En- hancing healthcare decision support through explainable ai models for risk prediction.Decision Support Systems, 181:114228, 2024

2024
[33]

Gaussian processes for machine learning.International journal of neural systems, 14(02):69–106, 2004

Matthias Seeger. Gaussian processes for machine learning.International journal of neural systems, 14(02):69–106, 2004. 11

2004
[34]

Springer, 1997

Mike West and Jeff Harrison.Bayesian forecasting and dynamic models. Springer, 1997

1997
[35]

Recursive regularization for inferring gene networks from time-course gene expression profiles.BMC systems biology, 3(1):41, 2009

Teppei Shimamura, Seiya Imoto, Rui Yamaguchi, André Fujita, Masao Nagasaki, and Satoru Miyano. Recursive regularization for inferring gene networks from time-course gene expression profiles.BMC systems biology, 3(1):41, 2009

2009
[36]

Time-dependent covariates in the cox proportional-hazards regression model.Annual review of public health, 20(1):145–157, 1999

Lloyd D Fisher and Danyu Y Lin. Time-dependent covariates in the cox proportional-hazards regression model.Annual review of public health, 20(1):145–157, 1999

1999
[37]

probabilities

Ingrid Daubechies.Ten lectures on wavelets. SIAM, 1992. 12 Table 5: TSA toolbox by category (candidate methods; router picks a sparse subset per query following Appendix E). Category Representative methods Typical intents Statistical Test- ing Paired t-test; Repeated Measures ANOV A; Wilcoxon test; Bayesian change point detection Time point comparison; Va...

1992
[38]

Extract trends for AFP, Hematemesis, Headache: - Severity slope, sudden changes, abnormal duration
[39]

Critical

Output: [Indicator, Time Window, Trend, Inflection Points]. Constraints: - "Critical" AFP = possible liver disease progression. - Single "None" for Headache $\approx$ resolved. TSA narrative output. === Normal Alpha-fetoprotein === Window: 2025-06-18 to 2027-01-09 Trend: Upward (Progressive) | Slope: 0.001892 Inflections: [{’Time’: ’2026-08-12’, ’Prev’: ’...

2025
[40]

TSA trends: AFP up, Hematemesis progress, Headache worse
[41]

Knowledge Base: - AFP Critical + Hematemesis Critical $\rightarrow$Liver Cirrhosis (0.85); - Headache worse + Blurred Vision $\rightarrow$Hypertensive Encephalopathy (0.6). Task:
[42]

Calculate probabilities (use prior diagnoses: Liver Cirrhosis, Gout)
[43]

Rank top 5 diseases + explain logic
[44]

Constraints: - High confidence requires corroborating trend matches

List gaps for consultation. Constraints: - High confidence requires corroborating trend matches. Initial ranked hypotheses (illustrative). === Initial Risk Ranking (Top 5) ===
[45]

Severe Liver Cirrhosis Exacerbation - Logic: AFP up (Critical) + Hematemesis up (Critical) + prior - Prob: 0.85
[46]

Hypertensive Encephalopathy - Logic: Headache worse + patient’s blurred vision - Prob: 0.60
[47]

Acute Gastric Hemorrhage - Logic: Hematemesis (suspected dark red) + unconfirmed alcohol - Prob: 0.45
[48]

Gouty Arthritis Flare-up - Logic: Prior Gout + no recent joint pain (gap) - Prob: 0.20
[49]

Migraine - Logic: Past "Extreme" headache + no aura (gap) - Prob: 0.15 === Gaps to Resolve ===
[50]

AFP spike cause (mid-2026): Abdominal pain? Alcohol?

2026
[51]

Headache: High BP? Painkillers? 19
[52]

Did you have abdominal pain or drink alcohol in mid-2026?

Hematemesis: Frequency? Color? Black stools? C.2.3 Gap-to-question mapping These questions correspond to unresolved slots enumerated by TOPGAPSin Alg. 1. Questions tied to structured gaps. Q1: Q1 (AFP gap): "Did you have abdominal pain or drink alcohol in mid-2026?" Q2: Q2 (Headache gap): "Any high BP, blurred vision, or painkillers lately?" Q3: Q3 (Hemat...

2026
[53]

Severe Liver Cirrhosis Exacerbation - New: + Alcohol (mid-2026) + abdominal pain (Jul 2026) - Prob: 0.85 $\rightarrow$0.92

2026
[54]

Hypertensive Encephalopathy - New: + Confirmed blurred vision + poor sleep (BP risk) - Prob: 0.60 $\rightarrow$0.75
[55]

Acute Gastric Hemorrhage - New: + Dark red hematemesis (monthly) + alcohol (gastric irritant) - Prob: 0.45 $\rightarrow$0.55
[56]

Gouty Arthritis Flare-up - New: - No joint pain reported - Prob: 0.20 $\rightarrow$0.10
[57]

=== Final Risk Prediction (patient_0077) ===

Migraine - New: - No aura + blurred vision = BP link - Prob: 0.15 $\rightarrow$0.05 C.2.4 Narrative wrap-up Clinician-facing summary. === Final Risk Prediction (patient_0077) ===
[58]

High Risk: Severe Liver Cirrhosis Exacerbation - Basis: AFP up (Critical, Aug 2026); hematemesis (dark red, monthly); alcohol trigger + abdominal pain (Jul 2026); prior diagnosis

2026
[59]

Medium Risk: Hypertensive Encephalopathy - Basis: Headache worse; blurred vision; poor sleep (BP risk); no painkillers (rules out drug cause)
[60]

abrupt"/

Low Risk: Acute Gastric Hemorrhage - Basis: Dark red hematemesis (monthly); alcohol irritation; no black stools (no massive hemorrhage). Step-by-step audit trail. === Complete Chain-of-Thought === Step 1: TSA $\rightarrow$Extract 3 trends: 20 Table 7: KB governance checklist (abbreviated). Stage Protocol detail Source ingestion Public medical portals + gu...

2026
[61]

Disease: p=0.__ - one-sentence justification
[62]

Fairness checklist (aligned with §4 and the decoding defaults above)

Disease: p=0.__ - ... Fairness checklist (aligned with §4 and the decoding defaults above)
[63]

Identical templates per backbone; no ad-hoc chain-of-thought hints for competitors
[64]

Contexts truncated/padded to the same token budget before scoring
[65]

Parser extracts the first three probability lines; malformed outputs count as errors equally for every method
[66]

Random seeds, API endpoints, and batching policies are logged alongside the KB hashes in Appendix D. G Appendix G: Full conversational-suite table Figure 3 plots MedQA and HealthBench accuracy (mean ±std over five rerolls) across the five backbones for each agent recipe; Google and DirPred rows are omitted from the bar layout but appear numerically in Tab...

[1] [1]

S. Zhou, Z. Xu, M. Zhang, C. Xu, Y . Guo, Z. Zhan, and R. Zhang. Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence, 1(1):9, 2025

2025

[2] [2]

Context clues: Evaluating long context models for clinical prediction tasks on EHRs.arXiv preprint arXiv:2412.16178, 2024

Michael Wornow, Suhana Bedi, Miguel Angel Fuentes Hernandez, Ethan Steinberg, Jason Alan Fries, Christopher Ré, Sanmi Koyejo, and Nigam H Shah. Context clues: Evaluating long context models for clinical prediction tasks on EHRs.arXiv preprint arXiv:2412.16178, 2024

work page arXiv 2024

[3] [3]

X. Zhou, J. Zhou, C. Wang, Q. Xie, K. Ding, C. Mao, and Y . Luo. Ph-llm: Public health large language models for infoveillance.medRxiv, 2025. Preprint

2025

[4] [4]

N. Chan, F. Parker, W. Bennett, T. Wu, M. Y . Jia, J. Fackler, and K. Ghobadi. Medtsllm: Leveraging llms for multimodal medical time series analysis.arXiv preprint arXiv:2408.07773, 2024

work page arXiv 2024

[5] [5]

M. Feli, I. Azimi, P. Liljeberg, and A. M. Rahmani. An llm-powered agent for physiological data analysis: A case study on ppg-based heart rate estimation.arXiv preprint arXiv:2502.12836, 2025

work page arXiv 2025

[6] [6]

Shaik, X

T. Shaik, X. Tao, L. Li, H. Xie, H. N. Dai, F. Zhao, and J. Yong. Adaptive multi-agent deep reinforcement learning for timely healthcare interventions.arXiv preprint arXiv:2309.10980, 2023

work page arXiv 2023

[7] [7]

H. Nori, M. Daswani, C. Kelly, S. Lundberg, M. T. Ribeiro, M. Wilson, and E. Horvitz. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

work page arXiv 2025

[8] [8]

Milad Mirbabaie, Stefan Stieglitz, and Nicholas RJ Frick. Artificial intelligence in disease diagnostics: A critical review and classification on the current state of research guiding future direction.Health and Technology, 11(4):693–731, 2021

2021

[9] [9]

High-performance medicine: the convergence of human and artificial intelligence

Eric J Topol. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1):44–56, 2019

2019

[10] [10]

Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

2025

[11] [11]

Capabilities of gemini models in medicine.Google Research, 2023

Daniel McDuff, Mohammad Norouzi, Scott Lundberg, Jianfeng Gao, Emre Kiciman, Saurabh Gombar, Karan Patel, Brian Lansdell, Chun Hwei Teo, Chunyuan Liao, et al. Capabilities of gemini models in medicine.Google Research, 2023. Preprint

2023

[12] [12]

A guide to deep learning in healthcare.Nature Medicine, 25(1):24–29, 2019

Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, V olodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare.Nature Medicine, 25(1):24–29, 2019

2019

[13] [13]

arXiv preprint arXiv:2503.17407 , year=

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al. A comprehensive survey on long context language modeling.arXiv preprint arXiv:2503.17407, 2025

work page arXiv 2025

[14] [14]

Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2402.11588, 2024

Xiao Wang, Yifan Li, Ming Zhang, Yuxiao Zhang, Yuan Liu, Xiang Liu, and Rui Zhang. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2402.11588, 2024

work page arXiv 2024

[15] [15]

Agentmath: Empowering mathematical reasoning for large language models via tool-augmented agent,

Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng, Yufei Wang, Tao Yang, Han Hu, Yansong Tang, and Di Wang. Agentmath: Empowering mathematical reasoning for large language models via tool-augmented agent.arXiv preprint arXiv:2512.20745, 2025

work page arXiv 2025

[16] [16]

ChatDev: Communicative Agents for Software Development

Chen Qian, Xin Cong, Cheng Yang, Weilin Chen, Juyoung Su, Jiayi Zhang, Yuxiao Zhang, Yuan Liu, and Yuan Li. Communicative agents for software development.arXiv preprint arXiv:2307.07924, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Improving llm mathematical reasoning capabilities using external tools, 2025

Jack Albright and Sheden Andemicael. Improving llm mathematical reasoning capabilities using external tools, 2025. 10

2025

[18] [18]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yuxuan Liu, Tianyu Han, Jie Han, Yuan Li, Hao Zhang, Zhengyan Liu, Jiawei Liu, Xincan Liu, Zihan Liu, Xiao Liu, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InInternational Conference on Learning Representations, 2024

2024

[19] [19]

MIT Press, 2009

Daphne Koller and Nir Friedman.Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009

2009

[20] [20]

A survey on unifying large language models and knowledge graphs for biomedicine and healthcare

Ran Xu, Patrick Jiang, Linhao Luo, Cao Xiao, Adam Cross, Shirui Pan, Jimeng Sun, and Carl Yang. A survey on unifying large language models and knowledge graphs for biomedicine and healthcare. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6195–6205, 2025

2025

[21] [21]

Biobridge: Bridging biomedical foundation models via knowledge graphs.arXiv preprint arXiv:2310.03320, 2023

Zifeng Wang, Zichen Wang, Balasubramaniam Srinivasan, Vassilis N Ioannidis, Huzefa Rang- wala, and Rishita Anubhai. Biobridge: Bridging biomedical foundation models via knowledge graphs.arXiv preprint arXiv:2310.03320, 2023

work page arXiv 2023

[22] [22]

Álvaro García-Barragán, Ahmad Sakor, Maria-Esther Vidal, Ernestina Menasalvas, Juan Cristo- bal Sanchez Gonzalez, Mariano Provencio, and Víctor Robles. Nssc: a neuro-symbolic ai system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes.Medical & Biological Engineering & Computing, 63(3):749–772, 2025

2025

[23] [23]

Integrating knowledge graphs with symbolic ai: The path to interpretable hybrid ai systems in medicine.Journal of Web Semantics, 84:100856, 2025

Maria-Esther Vidal, Yashrajsinh Chudasama, Hao Huang, Disha Purohit, and Maria Torrente. Integrating knowledge graphs with symbolic ai: The path to interpretable hybrid ai systems in medicine.Journal of Web Semantics, 84:100856, 2025

2025

[24] [24]

Neurosymbolic ai for reasoning on biomedical knowledge graphs.arXiv preprint arXiv:2307.08411, 2023

Lauren Nicole DeLong, Ramon Fernández Mir, Zonglin Ji, Fiona Niamh Coulter Smith, and Jacques D Fleuriot. Neurosymbolic ai for reasoning on biomedical knowledge graphs.arXiv preprint arXiv:2307.08411, 2023

work page arXiv 2023

[25] [25]

Brittleness and promise: Knowledge graph based reward modeling for diagnostic reasoning.arXiv preprint arXiv:2509.18316, 2025

Saksham Khatwani, He Cheng, Majid Afshar, Dmitriy Dligach, and Yanjun Gao. Brittleness and promise: Knowledge graph based reward modeling for diagnostic reasoning.arXiv preprint arXiv:2509.18316, 2025

work page arXiv 2025

[26] [26]

Reasoning-enhanced healthcare predictions with knowledge graph community retrieval.arXiv preprint arXiv:2410.04585, 2024

Pengcheng Jiang, Cao Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, and Jiawei Han. Reasoning-enhanced healthcare predictions with knowledge graph community retrieval.arXiv preprint arXiv:2410.04585, 2024

work page arXiv 2024

[27] [27]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021

[28] [28]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Learning to ask like a physician: a discharge summary clinical questions (discq) dataset.bioRxiv preprint, 2025

Eric Lehman. Learning to ask like a physician: a discharge summary clinical questions (discq) dataset.bioRxiv preprint, 2025. Dataset/draft as cited DiSCQ

2025

[30] [30]

Time-mmd: Multi-domain multimodal dataset for time series analysis.Advances in Neural Information Processing Systems, 37:77888–77933, 2024

Haoxin Liu, Shangqing Xu, Zhiyuan Zhao, Lingkai Kong, Harshavardhan Prabhakar Kamarthi, Aditya Sasanur, Megha Sharma, Jiaming Cui, Qingsong Wen, Chao Zhang, et al. Time-mmd: Multi-domain multimodal dataset for time series analysis.Advances in Neural Information Processing Systems, 37:77888–77933, 2024

2024

[31] [31]

G. Lee, W. Yu, K. Shin, W. Cheng, and H. Chen. Timecap: Learning to contextualize, augment, and predict time series events with large language model agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18082–18090, April 2025

2025

[32] [32]

En- hancing healthcare decision support through explainable ai models for risk prediction.Decision Support Systems, 181:114228, 2024

Shuai Niu, Qing Yin, Jing Ma, Yunya Song, Yida Xu, Liang Bai, Wei Pan, and Xian Yang. En- hancing healthcare decision support through explainable ai models for risk prediction.Decision Support Systems, 181:114228, 2024

2024

[33] [33]

Gaussian processes for machine learning.International journal of neural systems, 14(02):69–106, 2004

Matthias Seeger. Gaussian processes for machine learning.International journal of neural systems, 14(02):69–106, 2004. 11

2004

[34] [34]

Springer, 1997

Mike West and Jeff Harrison.Bayesian forecasting and dynamic models. Springer, 1997

1997

[35] [35]

Recursive regularization for inferring gene networks from time-course gene expression profiles.BMC systems biology, 3(1):41, 2009

Teppei Shimamura, Seiya Imoto, Rui Yamaguchi, André Fujita, Masao Nagasaki, and Satoru Miyano. Recursive regularization for inferring gene networks from time-course gene expression profiles.BMC systems biology, 3(1):41, 2009

2009

[36] [36]

Time-dependent covariates in the cox proportional-hazards regression model.Annual review of public health, 20(1):145–157, 1999

Lloyd D Fisher and Danyu Y Lin. Time-dependent covariates in the cox proportional-hazards regression model.Annual review of public health, 20(1):145–157, 1999

1999

[37] [37]

probabilities

Ingrid Daubechies.Ten lectures on wavelets. SIAM, 1992. 12 Table 5: TSA toolbox by category (candidate methods; router picks a sparse subset per query following Appendix E). Category Representative methods Typical intents Statistical Test- ing Paired t-test; Repeated Measures ANOV A; Wilcoxon test; Bayesian change point detection Time point comparison; Va...

1992

[38] [38]

Extract trends for AFP, Hematemesis, Headache: - Severity slope, sudden changes, abnormal duration

[39] [39]

Critical

Output: [Indicator, Time Window, Trend, Inflection Points]. Constraints: - "Critical" AFP = possible liver disease progression. - Single "None" for Headache $\approx$ resolved. TSA narrative output. === Normal Alpha-fetoprotein === Window: 2025-06-18 to 2027-01-09 Trend: Upward (Progressive) | Slope: 0.001892 Inflections: [{’Time’: ’2026-08-12’, ’Prev’: ’...

2025

[40] [40]

TSA trends: AFP up, Hematemesis progress, Headache worse

[41] [41]

Knowledge Base: - AFP Critical + Hematemesis Critical $\rightarrow$Liver Cirrhosis (0.85); - Headache worse + Blurred Vision $\rightarrow$Hypertensive Encephalopathy (0.6). Task:

[42] [42]

Calculate probabilities (use prior diagnoses: Liver Cirrhosis, Gout)

[43] [43]

Rank top 5 diseases + explain logic

[44] [44]

Constraints: - High confidence requires corroborating trend matches

List gaps for consultation. Constraints: - High confidence requires corroborating trend matches. Initial ranked hypotheses (illustrative). === Initial Risk Ranking (Top 5) ===

[45] [45]

Severe Liver Cirrhosis Exacerbation - Logic: AFP up (Critical) + Hematemesis up (Critical) + prior - Prob: 0.85

[46] [46]

Hypertensive Encephalopathy - Logic: Headache worse + patient’s blurred vision - Prob: 0.60

[47] [47]

Acute Gastric Hemorrhage - Logic: Hematemesis (suspected dark red) + unconfirmed alcohol - Prob: 0.45

[48] [48]

Gouty Arthritis Flare-up - Logic: Prior Gout + no recent joint pain (gap) - Prob: 0.20

[49] [49]

Migraine - Logic: Past "Extreme" headache + no aura (gap) - Prob: 0.15 === Gaps to Resolve ===

[50] [50]

AFP spike cause (mid-2026): Abdominal pain? Alcohol?

2026

[51] [51]

Headache: High BP? Painkillers? 19

[52] [52]

Did you have abdominal pain or drink alcohol in mid-2026?

Hematemesis: Frequency? Color? Black stools? C.2.3 Gap-to-question mapping These questions correspond to unresolved slots enumerated by TOPGAPSin Alg. 1. Questions tied to structured gaps. Q1: Q1 (AFP gap): "Did you have abdominal pain or drink alcohol in mid-2026?" Q2: Q2 (Headache gap): "Any high BP, blurred vision, or painkillers lately?" Q3: Q3 (Hemat...

2026

[53] [53]

Severe Liver Cirrhosis Exacerbation - New: + Alcohol (mid-2026) + abdominal pain (Jul 2026) - Prob: 0.85 $\rightarrow$0.92

2026

[54] [54]

Hypertensive Encephalopathy - New: + Confirmed blurred vision + poor sleep (BP risk) - Prob: 0.60 $\rightarrow$0.75

[55] [55]

Acute Gastric Hemorrhage - New: + Dark red hematemesis (monthly) + alcohol (gastric irritant) - Prob: 0.45 $\rightarrow$0.55

[56] [56]

Gouty Arthritis Flare-up - New: - No joint pain reported - Prob: 0.20 $\rightarrow$0.10

[57] [57]

=== Final Risk Prediction (patient_0077) ===

Migraine - New: - No aura + blurred vision = BP link - Prob: 0.15 $\rightarrow$0.05 C.2.4 Narrative wrap-up Clinician-facing summary. === Final Risk Prediction (patient_0077) ===

[58] [58]

High Risk: Severe Liver Cirrhosis Exacerbation - Basis: AFP up (Critical, Aug 2026); hematemesis (dark red, monthly); alcohol trigger + abdominal pain (Jul 2026); prior diagnosis

2026

[59] [59]

Medium Risk: Hypertensive Encephalopathy - Basis: Headache worse; blurred vision; poor sleep (BP risk); no painkillers (rules out drug cause)

[60] [60]

abrupt"/

Low Risk: Acute Gastric Hemorrhage - Basis: Dark red hematemesis (monthly); alcohol irritation; no black stools (no massive hemorrhage). Step-by-step audit trail. === Complete Chain-of-Thought === Step 1: TSA $\rightarrow$Extract 3 trends: 20 Table 7: KB governance checklist (abbreviated). Stage Protocol detail Source ingestion Public medical portals + gu...

2026

[61] [61]

Disease: p=0.__ - one-sentence justification

[62] [62]

Fairness checklist (aligned with §4 and the decoding defaults above)

Disease: p=0.__ - ... Fairness checklist (aligned with §4 and the decoding defaults above)

[63] [63]

Identical templates per backbone; no ad-hoc chain-of-thought hints for competitors

[64] [64]

Contexts truncated/padded to the same token budget before scoring

[65] [65]

Parser extracts the first three probability lines; malformed outputs count as errors equally for every method

[66] [66]

Random seeds, API endpoints, and batching policies are logged alongside the KB hashes in Appendix D. G Appendix G: Full conversational-suite table Figure 3 plots MedQA and HealthBench accuracy (mean ±std over five rerolls) across the five backbones for each agent recipe; Google and DirPred rows are omitted from the bar layout but appear numerically in Tab...