pith. the verified trust layer for science. sign in

arxiv: 2602.01297 · v3 · submitted 2026-02-01 · 💻 cs.AI

RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis

Pith reviewed 2026-05-16 08:38 UTC · model grok-4.3

classification 💻 cs.AI
keywords clinical diagnosislarge language modelsmulti-expert frameworkmedical knowledge graphneurologyclosed-loop reasoningelectronic medical records
0
0 comments X p. Extension

The pith

RE-MCDF introduces a closed-loop multi-expert LLM system guided by a medical knowledge graph to enforce logical consistency in neurological diagnosis from noisy records.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single LLMs and loosely structured multi-agent setups often drift into inconsistent diagnoses when facing sparse, heterogeneous neurology EMRs because they overlook relations such as mutual exclusivity or diagnostic confusion between diseases. RE-MCDF counters this with a generation-verification-revision loop that combines a primary expert for candidate diagnoses, a laboratory expert for dynamic indicator prioritization, and a multi-relation awareness expert group that applies explicit logical constraints drawn from a medical knowledge graph. The framework reweights evidence adaptively and corrects implausible hypotheses before final output. Experiments on the neurology subset of CMEMR and a curated dataset show consistent gains over baselines in complex cases. A sympathetic reader would care because reliable grounding of LLM outputs in domain relations could reduce self-reinforcing errors where evidence is incomplete.

Core claim

RE-MCDF establishes a relation-enhanced closed-loop architecture in which a primary expert generates candidate diagnoses and evidence, a laboratory expert dynamically prioritizes heterogeneous clinical indicators, and a multi-relation awareness expert group validates and revises outputs to enforce inter-disease logical constraints supplied by a medical knowledge graph, yielding higher accuracy than prior single-agent or shallow multi-agent baselines on neurology EMR tasks.

What carries the argument

The generation-verification-revision closed-loop that integrates a primary expert, a laboratory expert, and a multi-relation awareness expert group whose decisions are constrained by a medical knowledge graph.

Load-bearing premise

The multi-relation awareness expert group, guided by the medical knowledge graph, can reliably detect and correct logically inconsistent diagnoses without missing clinically important nuances or introducing new errors from the LLMs themselves.

What would settle it

A collection of neurology cases containing subtle but valid diagnostic hypotheses that the system incorrectly rules out because the knowledge-graph constraints treat them as incompatible, leading to measurable drops in recall compared with expert clinician judgment.

Figures

Figures reproduced from arXiv: 2602.01297 by Jie Yang, Lianfen Huang, Seyyedali Hosseinalipour, Shaowei Shen, Xiaohong Yang, Yang Zou, Yongcai Zhang.

Figure 1
Figure 1. Figure 1: Architecture of RE-MCDF. It consists of a primary expert that generates initial diagnosis–evidence pairs from EMRs, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance analysis and human evaluation. Left [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Electronic medical records (EMRs), particularly in neurology, are inherently heterogeneous, sparse, and noisy, which poses significant challenges for large language models (LLMs) in clinical diagnosis. In such settings, single-agent systems are vulnerable to self-reinforcing errors, as their predictions lack independent validation and can drift toward spurious conclusions. Although recent multi-agent frameworks attempt to mitigate this issue through collaborative reasoning, their interactions are often shallow and loosely structured, failing to reflect the rigorous, evidence-driven processes used by clinical experts. More fundamentally, existing approaches largely ignore the rich logical dependencies among diseases, such as mutual exclusivity, pathological compatibility, and diagnostic confusion. This limitation prevents them from ruling out clinically implausible hypotheses, even when sufficient evidence is available. To overcome these, we propose RE-MCDF, a relation-enhanced multi-expert clinical diagnosis framework. RE-MCDF introduces a generation--verification--revision closed-loop architecture that integrates three complementary components: (i) a primary expert that generates candidate diagnoses and supporting evidence, (ii) a laboratory expert that dynamically prioritizes heterogeneous clinical indicators, and (iii) a multi-relation awareness and evaluation expert group that explicitly enforces inter-disease logical constraints. Guided by a medical knowledge graph (MKG), the first two experts adaptively reweight EMR evidence, while the expert group validates and corrects candidate diagnoses to ensure logical consistency. Extensive experiments on the neurology subset of CMEMR (NEEMRs) and on our curated dataset (XMEMRs) demonstrate that RE-MCDF consistently outperforms state-of-the-art baselines in complex diagnostic scenarios (https://github.com/shenshaowei/RE-MCDF).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes RE-MCDF, a closed-loop multi-expert LLM framework for knowledge-grounded clinical diagnosis from heterogeneous EMRs. It combines a primary expert for candidate generation, a laboratory expert for evidence prioritization, and a multi-relation awareness expert group guided by a medical knowledge graph (MKG) to enforce logical constraints such as mutual exclusivity. The architecture uses a generation-verification-revision loop, and the manuscript reports consistent outperformance over state-of-the-art baselines on the neurology subset of CMEMR (NEEMRs) and a curated dataset (XMEMRs).

Significance. If the empirical claims are substantiated with full experimental details and error analysis, the work could advance multi-agent LLM systems for clinical tasks by explicitly incorporating inter-disease logical dependencies via MKG guidance, addressing a gap in existing collaborative reasoning frameworks that often lack rigorous constraint enforcement. The closed-loop design and adaptive reweighting of EMR evidence represent a structured attempt to reduce self-reinforcing errors.

major comments (3)
  1. [Abstract] Abstract: the claim that RE-MCDF 'consistently outperforms state-of-the-art baselines in complex diagnostic scenarios' on NEEMRs and XMEMRs provides no information on the specific baselines used, evaluation metrics, statistical tests, number of runs, or implementation details of the three expert components, leaving the central empirical claim without verifiable support.
  2. [Method] Method section (generation-verification-revision loop and multi-relation awareness expert group): no quantitative breakdown or failure-case analysis is presented for cases where the MKG-guided expert group misses mutual-exclusivity violations, introduces new LLM-induced errors, or fails to correct logically inconsistent diagnoses; without this, gains cannot be confidently attributed to the logical-enforcement mechanism rather than dataset properties.
  3. [Experiments] Experiments section: the manuscript does not describe the construction or coverage of the medical knowledge graph, the curation and train/test splits of XMEMRs, or how the primary/laboratory experts are implemented and prompted, all of which are load-bearing for assessing reproducibility and the source of reported improvements.
minor comments (2)
  1. [Abstract] The GitHub link is given without a specific commit hash or release tag, which would improve reproducibility.
  2. [Method] Notation for the three expert roles and the MKG integration could be formalized with a diagram or pseudocode to clarify the information flow in the closed loop.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving clarity, reproducibility, and attribution of gains in our work. We agree that additional details are warranted and will revise the manuscript accordingly to strengthen the presentation of empirical claims, method analysis, and experimental descriptions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that RE-MCDF 'consistently outperforms state-of-the-art baselines in complex diagnostic scenarios' on NEEMRs and XMEMRs provides no information on the specific baselines used, evaluation metrics, statistical tests, number of runs, or implementation details of the three expert components, leaving the central empirical claim without verifiable support.

    Authors: We acknowledge that the abstract is high-level and omits these specifics due to length constraints. The full manuscript details the baselines (e.g., GPT-4, Med-PaLM, and multi-agent variants), metrics (accuracy, macro-F1, and clinical relevance scores), statistical tests (paired t-tests with p<0.05), and number of runs (5 independent runs with different seeds). Implementation details for the experts appear in Section 3. In revision, we will expand the abstract to include a concise summary of these elements and add a pointer to the experimental section for full verification. revision: yes

  2. Referee: [Method] Method section (generation-verification-revision loop and multi-relation awareness expert group): no quantitative breakdown or failure-case analysis is presented for cases where the MKG-guided expert group misses mutual-exclusivity violations, introduces new LLM-induced errors, or fails to correct logically inconsistent diagnoses; without this, gains cannot be confidently attributed to the logical-enforcement mechanism rather than dataset properties.

    Authors: We agree that a dedicated failure-case analysis would strengthen attribution of improvements to the MKG-guided logical enforcement. The current manuscript focuses on overall performance gains but lacks quantitative breakdowns of missed violations or introduced errors. In the revision, we will add a new subsection in Experiments with error analysis, including counts of mutual-exclusivity misses, new LLM errors, and correction rates across the NEEMRs and XMEMRs datasets, to better isolate the contribution of the closed-loop mechanism. revision: yes

  3. Referee: [Experiments] Experiments section: the manuscript does not describe the construction or coverage of the medical knowledge graph, the curation and train/test splits of XMEMRs, or how the primary/laboratory experts are implemented and prompted, all of which are load-bearing for assessing reproducibility and the source of reported improvements.

    Authors: We recognize that these details are essential for reproducibility. The manuscript briefly references the MKG (derived from standard medical ontologies like SNOMED-CT and UMLS with neurology-specific relations) and XMEMRs curation (from public EMR sources with expert annotation), but does not provide full construction steps, coverage statistics, or exact splits. Prompt templates and implementation for the primary and laboratory experts are described at a high level in Section 3. In the revised version, we will expand the Experiments section with a dedicated subsection detailing MKG construction and coverage, XMEMRs curation process and 70/30 train/test splits, and full prompting strategies plus hyperparameters for all experts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical multi-expert architecture evaluated on external datasets

full rationale

The paper describes RE-MCDF as a generation-verification-revision closed-loop system with three expert components (primary, laboratory, multi-relation awareness) guided by an external medical knowledge graph. Claims of outperformance rest on experiments against baselines on the neurology subset of CMEMR (NEEMRs) and the curated XMEMRs dataset. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the architecture description. The central performance claims are presented as empirical results rather than reductions to the framework's own inputs by construction, satisfying the criteria for a self-contained non-circular evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the assumption that LLMs can be reliably specialized into the three expert roles and that the medical knowledge graph accurately encodes the necessary inter-disease constraints.

axioms (2)
  • domain assumption LLMs can be effectively prompted to perform specialized clinical tasks such as diagnosis generation, indicator prioritization, and logical consistency checking
    Invoked in the description of the primary, laboratory, and multi-relation experts.
  • domain assumption The medical knowledge graph captures all relevant logical dependencies among diseases including mutual exclusivity and pathological compatibility
    Used to guide the expert group in validating and correcting candidate diagnoses.
invented entities (1)
  • Multi-relation awareness and evaluation expert group no independent evidence
    purpose: To explicitly enforce inter-disease logical constraints during diagnosis validation
    New architectural component introduced to address limitations of prior multi-agent systems

pith-pipeline@v0.9.0 · 5624 in / 1403 out tokens · 54966 ms · 2026-05-16T08:38:10.802354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Expert-Level Automated Diagnosis of the Pediatric ECG Using a Deep Neural Network,

    J. Mayourian, W. G. La Cavaet al., “Expert-Level Automated Diagnosis of the Pediatric ECG Using a Deep Neural Network,”JACC Clin. Electrophysiol., vol. 11, no. 6, pp. 1308–1320, 2025

  2. [2]

    Cross-Domain Mutual-Assistance Learning Framework for Fully Automated Diagnosis of Primary Tumor in Na- sopharyngeal Carcinoma,

    X. Dong, K. Yanget al., “Cross-Domain Mutual-Assistance Learning Framework for Fully Automated Diagnosis of Primary Tumor in Na- sopharyngeal Carcinoma,”IEEE Trans. Med. Imag., vol. 43, no. 11, pp. 3676–3689, 2024

  3. [3]

    Towards Factual Consistency in Clinical Summarization: A Self-correction Strategy,

    J. Yanget al., “Towards Factual Consistency in Clinical Summarization: A Self-correction Strategy,”Hum.-Cent. Comput. Info., vol. 15, 2025

  4. [4]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

    J. Weiet al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” inNeurIPS, vol. 35, 2022, pp. 24 824–24 837

  5. [5]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models,

    S. Yao, D. Yuet al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,” inNeurIPS, vol. 36, 2023, pp. 11 809–11 822

  6. [6]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models,

    X. Wang, J. Weiet al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models,” inICLR, 2023

  7. [7]

    Verification is All You Need: Prompting Large Language Models for Zero-Shot Clinical Coding,

    S. Li, C. Zhenget al., “Verification is All You Need: Prompting Large Language Models for Zero-Shot Clinical Coding,”IEEE J. Biomed. Health Inform., vol. 29, no. 11, pp. 8536–8549, 2025

  8. [8]

    Mind AI’s Mind: A Clinically Aligned Explain- able AI Pipeline for Depression Diagnosis via Large Language Models,

    Y . Wu, G. Wanet al., “Mind AI’s Mind: A Clinically Aligned Explain- able AI Pipeline for Depression Diagnosis via Large Language Models,” IEEE Trans. Affect. Comput., vol. 17, no. 1, pp. 739–756, 2026

  9. [9]

    Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study,

    Y . Gao, R. Liet al., “Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study,”JMIR AI, vol. 4, 2025

  10. [10]

    CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs

    Y . Liu, S. Liet al., “CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs,” arXiv preprint arXiv:2601.11047, 2026

  11. [11]

    MedIKAL: Integrating Knowledge Graphs as Assistants of LLMs for Enhanced Clinical Diagnosis on EMRs,

    M. Jia, J. Duanet al., “MedIKAL: Integrating Knowledge Graphs as Assistants of LLMs for Enhanced Clinical Diagnosis on EMRs,” in COLING, 2025, pp. 9278–9298

  12. [12]

    Narrative Structure Extraction in Disinformation and Trustworthy News: A Comparison of LLM, KG, and KG-Augmented Pipelines,

    J. Mandravickait ˙e, “Narrative Structure Extraction in Disinformation and Trustworthy News: A Comparison of LLM, KG, and KG-Augmented Pipelines,” inWorkshops of KONVENS, 2025, pp. 86–103

  13. [13]

    MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models,

    Y . Wen, Z. Wanget al., “MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models,” inACL, 2024, pp. 10 370–10 388

  14. [14]

    KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph,

    J. Jiang, K. Zhouet al., “KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph,” inACL, 2025, pp. 9505–9523

  15. [15]

    AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement,

    S. Singhet al., “AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement,” inICRA, 2025, pp. 4345–4351

  16. [16]

    Accurate Insights, Trustworthy Interactions: De- signing a Collaborative AI-Human Multi-Agent System with Knowledge Graph for Diagnosis Prediction,

    H. Li, X. Chenget al., “Accurate Insights, Trustworthy Interactions: De- signing a Collaborative AI-Human Multi-Agent System with Knowledge Graph for Diagnosis Prediction,” inCHI, 2025

  17. [17]

    Knowledge-Aware Co-Reasoning for Multi- disciplinary Collaboration,

    X. Li, Wanghaijiaoet al., “Knowledge-Aware Co-Reasoning for Multi- disciplinary Collaboration,” inEMNLP, 2025, pp. 13 604–13 620

  18. [18]

    Leveraging Multi-Agent Systems and Large Language Models for Diabetes Knowledge Graphs,

    D. H. Ho, U. Daset al., “Leveraging Multi-Agent Systems and Large Language Models for Diabetes Knowledge Graphs,” inBigData, 2024, pp. 3401–3410

  19. [19]

    High-Quality Disease ification in Line with International Standards: Current Status and Reflections,

    J. Zhou and A. Liu, “High-Quality Disease ification in Line with International Standards: Current Status and Reflections,”Med. J. Peking Union Med. Coll. Hosp., vol. 15, no. 5, pp. 993–998, 2024

  20. [20]

    Predicting the Onset of Chronic Kid- ney Disease (CKD) for Diabetic Patients with Aggregated Longitudinal EMR Data,

    N. Aminnejad, M. Greiveret al., “Predicting the Onset of Chronic Kid- ney Disease (CKD) for Diabetic Patients with Aggregated Longitudinal EMR Data,”PLOS Digital Health, vol. 4, no. 1, p. e0000700, 2025

  21. [21]

    Encounter Decisions for Patients With Diverse Sociodemographic Characteristics: Predictive Analytics of EMR Data From a Large Chain of Clinic,

    U. K. Mukherjeeet al., “Encounter Decisions for Patients With Diverse Sociodemographic Characteristics: Predictive Analytics of EMR Data From a Large Chain of Clinic,”JOM, vol. 71, no. 4, pp. 447–482, 2025

  22. [22]

    Multimodal Representation Learning Based on Personalized Graph-Based Fusion for Mortality Prediction Using Electronic Medical Records,

    A. Al-Dailami, H. Kuanget al., “Multimodal Representation Learning Based on Personalized Graph-Based Fusion for Mortality Prediction Using Electronic Medical Records,”Big Data Min. Anal., vol. 8, no. 4, pp. 933–950, 2025

  23. [23]

    FedComDist: Towards Effective Personalized Federated Learning for Patient Outcome Prediction Using Multi-Center Electronic Medical Records,

    A. Al-Dailamiet al., “FedComDist: Towards Effective Personalized Federated Learning for Patient Outcome Prediction Using Multi-Center Electronic Medical Records,”IEEE J. Biomed. Health, vol. 29, no. 8, pp. 6004–6016, 2025

  24. [24]

    Paging Dr. GPT: Extracting Informa- tion from Clinical Notes to Enhance Patient Predictions,

    D. Anderson, M. Andersonet al., “Paging Dr. GPT: Extracting Informa- tion from Clinical Notes to Enhance Patient Predictions,”arXiv preprint arXiv:2504.12338, 2025

  25. [25]

    From EMR Data to Clinical Insight: An LLM-Driven Framework for Automated Pre-Consultation Questionnaire Generation,

    R. Ding, Q. Sunet al., “From EMR Data to Clinical Insight: An LLM-Driven Framework for Automated Pre-Consultation Questionnaire Generation,”arXiv preprint arXiv:2508.00581, 2025

  26. [26]

    Classifying Unstructured Text in Electronic Health Records for Mental Health Prediction Models: Large Language Model Evaluation Study,

    N. C. Cardamone, M. Olfsonet al., “Classifying Unstructured Text in Electronic Health Records for Mental Health Prediction Models: Large Language Model Evaluation Study,”JMIR Med. Inf., vol. 13, no. 1, p. e65454, 2025

  27. [27]

    Unlocking Potential of Generative Large Language Models for Adverse Drug Reaction Relation Prediction in Iischarge Summaries: Analysis and Strategy,

    Y . L. Koon, H. X. Tanet al., “Unlocking Potential of Generative Large Language Models for Adverse Drug Reaction Relation Prediction in Iischarge Summaries: Analysis and Strategy,”Clin. Pharmacol. Ther., vol. 118, no. 6, pp. 1554–1561, 2025

  28. [28]

    LLM-DG: Leveraging Large Language Model for Enhanced Disease Prediction via Inter-Patient and Intra- Patient Modeling,

    Y . Kang, M. Yanget al., “LLM-DG: Leveraging Large Language Model for Enhanced Disease Prediction via Inter-Patient and Intra- Patient Modeling,”Inform. Fusion, vol. 121, p. 103145, 2025

  29. [29]

    Can Large Language Models Enhance Predic- tions of Disease Progression? Investigating Through Disease Network Link Prediction,

    H. Lu and U. Naseem, “Can Large Language Models Enhance Predic- tions of Disease Progression? Investigating Through Disease Network Link Prediction,” inEMNLP, 2024

  30. [30]

    Guiding Clinical Reasoning with Large Language Models via Knowledge Seeds,

    J. Wu, X. Wuet al., “Guiding Clinical Reasoning with Large Language Models via Knowledge Seeds,” inIJCAI, 2024, pp. 7491–7499

  31. [31]

    HyKGE: A Hypothesis Knowledge Graph Enhanced RAG Framework for Accurate and Reliable Medical LLMs Responses,

    X. Jiang, R. Zhanget al., “HyKGE: A Hypothesis Knowledge Graph Enhanced RAG Framework for Accurate and Reliable Medical LLMs Responses,” inACL, 2025, pp. 11 836–11 856

  32. [32]

    KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques,

    R. Yang, H. Liuet al., “KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques,” in Workshop on BNLP, 2024, pp. 155–166

  33. [33]

    Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph,

    J. Sun, C. Xuet al., “Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph,” inICRL, vol. 2024, 2024, pp. 3868–3898

  34. [34]

    Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs,

    B. Jin, C. Xieet al., “Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs,” inFindings of ACL, 2024, pp. 163–184

  35. [35]

    MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring,

    Q. Wang, R. Shenget al., “MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring,” arXiv preprint arXiv:2512.24181, 2025

  36. [36]

    MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models,

    J. Sun, C. Xuet al., “MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models,” in AAAI, vol. 40, no. 2, 2026, pp. 845–853

  37. [37]

    MAGIC: AN LLM-based Multi-Agent Activated Graph-reasoning Intelligent Collaboration model for Liver Disease Di- agnosis,

    B. Liu, Y . Nieet al., “MAGIC: AN LLM-based Multi-Agent Activated Graph-reasoning Intelligent Collaboration model for Liver Disease Di- agnosis,”Infor. Fusion, vol. 126, p. 103557, 2026

  38. [38]

    KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-Shot Diagnosis Prediction Using Multi- agent LLMs,

    Y . Xie, H. Cuiet al., “KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-Shot Diagnosis Prediction Using Multi- agent LLMs,” inAMIA, 2025, pp. 1–1

  39. [39]

    Clinical Knowledge Graph Construction and Evaluation with Multi-LLMs via Retrieval-Augmented Generation,

    U. Das, K. B. Atmakuriet al., “Clinical Knowledge Graph Construction and Evaluation with Multi-LLMs via Retrieval-Augmented Generation,” arXiv preprint arXiv:2601.01844, 2026

  40. [40]

    Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning,

    X. Wang, Y . Jianget al., “Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning,” inEMNLP, 2021, pp. 1800–1812

  41. [41]

    Dynamic Uncertainty Ranking: Enhancing Retrieval-Augmented In-Context Learning for Long-Tail Knowledge in LLMs,

    S. Yu, R. Baoet al., “Dynamic Uncertainty Ranking: Enhancing Retrieval-Augmented In-Context Learning for Long-Tail Knowledge in LLMs,” inNAACL, 2025, pp. 8985–8997

  42. [42]

    A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems,

    Z. Yi, J. Ouyanget al., “A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems,”ACM Comput. Surv., vol. 58, no. 6, 2025

  43. [43]

    Contrastive Learning with large language models for medical code prediction,

    Y . Wuet al., “Contrastive Learning with large language models for medical code prediction,”Expert Syst. Appl., vol. 277, p. 127241, 2025

  44. [44]

    MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration,

    D. Wan, J. Chenet al., “MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration,” inNAACL, 2025, pp. 9882–9901

  45. [45]

    Muma-Tom: Multi-Modal Multi-Agent Theory of Mind,

    H. Shi, S. Yeet al., “Muma-Tom: Multi-Modal Multi-Agent Theory of Mind,” inAAAI, vol. 39, no. 2, 2025, pp. 1510–1519

  46. [46]

    SciAgents: Automating Scientific Discovery through Bioinspired Multi-Agent Intelligent Graph Reason- ing,

    A. Ghafarollahi and M. J. Buehler, “SciAgents: Automating Scientific Discovery through Bioinspired Multi-Agent Intelligent Graph Reason- ing,”Advanced Materials, vol. 37, no. 22, p. 2413523, 2025

  47. [47]

    Ai hospital: Interactive evaluation and collaboration of llms as intern doctors for clinical diagnosis

    Z. Fan, J. Tanget al., “AI Hospital: Interactive Evaluation and Collabo- ration of LLMs as Intern Doctors for Clinical Diagnosis,”arXiv preprint arXiv:2402.09742, 2024