Recognition: unknown
Beyond the Individual: Virtualizing Multi-Disciplinary Reasoning for Clinical Intake via Collaborative Agents
Pith reviewed 2026-05-10 17:26 UTC · model grok-4.3
The pith
Aegle virtualizes multi-disciplinary team reasoning with graph-based agents and SOAP separation to improve outpatient clinical intake.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Aegle is a synchronous virtual MDT framework that brings MDT-level reasoning to outpatient consultations via a graph-based multi-agent architecture. It formalizes the consultation state using a structured SOAP representation, separating evidence collection from diagnostic reasoning to improve traceability and bias control. An orchestrator dynamically activates specialist agents, which perform decoupled parallel reasoning and are subsequently integrated by an aggregator into a coherent clinical note.
What carries the argument
Graph-based multi-agent architecture with an orchestrator for dynamic specialist activation, decoupled parallel reasoning by agents, and an aggregator for integration, all operating on separated SOAP evidence and diagnosis states.
Load-bearing premise
The multi-agent graph architecture with parallel reasoning and SOAP separation actually delivers MDT-level bias control and traceability that improves real-world clinical decisions beyond benchmark scores.
What would settle it
A randomized trial showing no reduction in diagnostic errors or incomplete documentation when Aegle assists intake compared to standard single-physician practice.
Figures
read the original abstract
The initial outpatient consultation is critical for clinical decision-making, yet it is often conducted by a single physician under time pressure, making it prone to cognitive biases and incomplete evidence capture. Although the Multi-Disciplinary Team (MDT) reduces these risks, they are costly and difficult to scale to real-time intake. We propose Aegle, a synchronous virtual MDT framework that brings MDT-level reasoning to outpatient consultations via a graph-based multi-agent architecture. Aegle formalizes the consultation state using a structured SOAP representation, separating evidence collection from diagnostic reasoning to improve traceability and bias control. An orchestrator dynamically activates specialist agents, which perform decoupled parallel reasoning and are subsequently integrated by an aggregator into a coherent clinical note. Experiments on ClinicalBench and a real-world RAPID-IPN dataset across 24 departments and 53 metrics show that Aegle consistently outperforms state-of-the-art proprietary and open-source models in documentation quality and consultation capability, while also improving final diagnosis accuracy. Our code is available at https://github.com/HovChen/Aegle.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Aegle, a graph-based multi-agent framework that virtualizes a Multi-Disciplinary Team (MDT) for initial outpatient clinical intake. It uses a synchronous orchestrator to activate specialist agents performing decoupled parallel reasoning on a structured SOAP consultation state, followed by aggregator integration into a coherent clinical note. The central claim is that this architecture delivers MDT-level bias control and traceability, leading to consistent outperformance over proprietary and open-source models in documentation quality, consultation capability, and final diagnosis accuracy, as shown on ClinicalBench and the real-world RAPID-IPN dataset spanning 24 departments and 53 metrics. Code is released.
Significance. If the architecture's mechanisms for parallel reasoning and SOAP separation can be shown to causally improve bias mitigation and traceability beyond what simple ensembling or extended context provide, the work would offer a scalable alternative to resource-intensive human MDTs in routine care. The multi-dataset evaluation and open code strengthen potential impact in clinical AI and multi-agent systems.
major comments (2)
- [Experiments] Experiments section (and abstract): the claim of consistent outperformance across 53 metrics on ClinicalBench and RAPID-IPN is presented without reported statistical tests, error bars, baseline implementation details, or data exclusion rules. This leaves open whether gains derive from the graph orchestrator, decoupled agents, and SOAP separation or from unaccounted factors such as ensembling or context length.
- [Evaluation] Evaluation and method sections: no bias-injection experiments, clinician-rated bias scores, or quantitative traceability audits (e.g., source attribution completeness or reasoning path coverage) are described to isolate the contribution of parallel specialist reasoning and aggregator integration to reduced cognitive bias. The central MDT-level bias-control claim therefore rests on indirect benchmark gains rather than targeted validation.
minor comments (2)
- [Introduction] The abstract and introduction use 'synchronous' and 'decoupled parallel' without clarifying how true simultaneity is achieved in the graph execution model.
- [Figures/Tables] Figure captions and table legends should explicitly state the number of runs or seeds used for the reported metrics.
Simulated Author's Rebuttal
Thank you for the detailed review and constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our presentation of Aegle's contributions. Below, we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [Experiments] Experiments section (and abstract): the claim of consistent outperformance across 53 metrics on ClinicalBench and RAPID-IPN is presented without reported statistical tests, error bars, baseline implementation details, or data exclusion rules. This leaves open whether gains derive from the graph orchestrator, decoupled agents, and SOAP separation or from unaccounted factors such as ensembling or context length.
Authors: We fully agree that reporting statistical significance, error bars, and detailed baseline information is essential for robust claims. In the revised manuscript, we will add statistical tests (such as paired t-tests or non-parametric equivalents with p-values adjusted for multiple comparisons) across the key metrics, include error bars in figures and tables, and expand the experimental setup section with precise descriptions of baseline model implementations, prompting strategies, and any data exclusion criteria used in the RAPID-IPN dataset. Furthermore, to directly address whether the gains stem from the proposed architecture rather than ensembling or context length, we will include additional ablation experiments comparing Aegle against (1) an ensemble of the same specialist models without the orchestrator and aggregator, and (2) single-model baselines with context lengths matched to Aegle's aggregated state. These revisions will help isolate the causal contributions of the graph-based parallel reasoning and SOAP separation. revision: yes
-
Referee: [Evaluation] Evaluation and method sections: no bias-injection experiments, clinician-rated bias scores, or quantitative traceability audits (e.g., source attribution completeness or reasoning path coverage) are described to isolate the contribution of parallel specialist reasoning and aggregator integration to reduced cognitive bias. The central MDT-level bias-control claim therefore rests on indirect benchmark gains rather than targeted validation.
Authors: We acknowledge that the current evaluation does not include direct bias-injection experiments or clinician-rated bias assessments, which would provide stronger causal evidence for the bias mitigation properties. The manuscript's claims are supported by consistent outperformance on real-world clinical data across multiple departments and metrics, which we interpret as evidence of improved bias control through the decoupled reasoning process. In the revision, we will add a new subsection in the discussion that explicitly discusses the mechanisms by which the orchestrator, specialist agents, and aggregator promote traceability (e.g., by maintaining separate SOAP components and logging agent contributions), and we will include quantitative traceability metrics such as reasoning path coverage where feasible from existing logs. We will also note the absence of bias-injection studies as a limitation and propose it as future work. This addresses the concern without requiring entirely new data collection at this stage. revision: partial
Circularity Check
No significant circularity: claims rest on external benchmarks and datasets
full rationale
The paper presents an empirical multi-agent system evaluated on independent external resources (ClinicalBench benchmark and real-world RAPID-IPN dataset spanning 24 departments and 53 metrics). No mathematical derivations, parameter fittings, or predictions are described that reduce by construction to the model's own inputs or self-citations. Performance comparisons are made against external proprietary and open-source baselines on held-out data, with no self-definitional loops, fitted-input predictions, or load-bearing self-citation chains. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-disciplinary team reasoning reduces cognitive biases and improves evidence capture compared to single-physician intake.
invented entities (1)
-
Aegle framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
InThe Twelfth International Conference on Learning Representa- tions
ChatEval: Towards better LLM-based eval- uators through multi-agent debate. InThe Twelfth International Conference on Learning Representa- tions. Qiyuan Chen, Jiahe Chen, Hongsen Huang, Qian Shao, Jintai Chen, Renjie Hua, Hongxia Xu, Ruijia Wu, Ren Chuan, and Jian Wu. 2025a. Cc-gseo-bench: A content-centric benchmark for measuring source influence in gene...
-
[2]
InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822
Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc. Jacqueline G. You, Reema H. Dbouk, Adam Land- man, David Y . Ting, Sayon Dutta, Julie C. Wang, Amanda J. Centi, Molly Macfarlane, Eran Bechor, Jonathan Letourneau, Gabrielle Cho...
-
[3]
**Based on the information**: All of your responses must be strictly based on the case text above
-
[4]
I’m not sure,
**No fabrication**: Do not imagine, extend, or invent any medical history or details that are not provided in the case. If the doctor asks about information that is completely absent from the materials, respond naturally with phrases such as “I’m not sure,” “I didn’t notice,” or “I don’t have that issue.”
-
[5]
**Non-professional stance**: You are an ordinary patient and should not provide medical explana- tions, professional interpretations, or diagnostic suggestions
-
[6]
Do not list information like a robot
**Natural expression**: Use conversational, natural, emotionally realistic first-person language (“I…”). Do not list information like a robot. **Expression Requirements** - Respond as a real patient speaking to a doctor, avoid mechanical answers, lists, or summaries. - When describing experiences, symptoms, and feelings, focus on subjective experience (e....
-
[7]
**Efficient decision-making**: For simple confirmation questions, routine information gathering, or single-dimension follow-up inquiries, you should generate the question directly without activating specialist physicians
-
[8]
For complex situations, dynamically dispatch specialist Agents (Specialists) and issue them instruc- tions that **align with the current phase goal**. **Decision logic**: - **Simple cases**: When only specific numerical values need to be supplemented, past medical history needs to be confirmed, or simple details need to be clarified → **keep `active_speci...
-
[9]
Analyze the input information and determine which SOAP field each piece of information should be written into
-
[10]
Write ‘chest pain for 2 hours’ into `history_of_present_illness`
Strictly output in the JSON format defined by `SpecialistOutput`, and in `draft_modifications` clearly specify the exact SOAP field to which each item of information belongs, for example: - “Write ‘chest pain for 2 hours’ into `history_of_present_illness`” - “Write ‘blood pressure 150/90 mmHg’ into `physical_examination`”
-
[11]
{next_question_instruction} historytaking diagnostic_synthesis phase_instructions: From the professional perspective of {spec_id}, carefully review the current case features. Focus on clinical information that is directly relevant to your specialty, including basic patient infor- mation, history of present illness, past medical history, physical examinati...
-
[12]
Further evaluation: - Planned tests, purpose, and timing, with guideline-based rationale - If surgery is indicated, include indication and approach
-
[13]
Medications: - Drug classes with indications, contraindica- tions, and monitoring points - Avoid specific dosages
-
[14]
Non-drug management: - Hydration goals, diet, lifestyle advice, pain management, and warning signs
-
[15]
Follow-up: - Follow-up timeframe, repeat tests, and visit format (in-person or remote) soap_guidance: Aggregator You are a recorder and decision-maker in a medical MDT. **Current phase: {phase_name}** As a professional physician, you must organize and document patient information strictly according to the standard SOAP format for the initial medical recor...
-
[16]
**If `specialist_outputs` is empty, extract information directly from the patient’s latest response and the coordinator’s suggestions, and update the SOAP record accordingly.**
**Information integration**: Carefully review the input from specialist physicians (if any). **If `specialist_outputs` is empty, extract information directly from the patient’s latest response and the coordinator’s suggestions, and update the SOAP record accordingly.**
-
[17]
**Empathy**: You are the doctor directly communicating with the patient; your tone should be warm, professional, and patient
-
[18]
If the patient has clearly stated that certain information cannot be provided, document this truthfully and do not ask again
**Logical consistency**: Maintain a complete and coherent medical record. If the patient has clearly stated that certain information cannot be provided, document this truthfully and do not ask again. **Strictly output in the specified JSON format** historytaking diagnostic_synthesis soap_guidance: (the same as specialist) aggregator_task_description: Summ...
2015
-
[19]
Present Illness & Comprehensive History 1 1.1 Detailed HPI Complaint: location, quality, severity, duration, onset, radiation, aggravating/relieving factors. 4 Incorrect/illogical; key elements missing Some elements present Most elements present All elements present Concise, organized, diagnostically salient 1 1.2 Prior Dx/Tx Course Prior evaluations/trea...
-
[20]
Physical Examination 2 2.1 Complete Physical Exam Comprehensive documentation of physical examination. 4 Errors/not addressed Major components missing Mostly complete; minor omissions Complete exam Well-organized; professional terms 2 2.2 Key Physical Findings Highlights diagnostically relevant positive/negative findings. 3 Missing/incorrect emphasis Part...
-
[21]
Diagnosis & Differential 3 3.1 Diagnostic Completeness Primary, secondary, and additional diagnoses. 4 Primary missing/incorrect Primary only Primary + some secondary Primary + all secondary Includes additional diagnoses 3 3.2 Objective Evidence Evidence from history, exam, and investigations. 4 Missing/incorrect evidence One domain Two domains All releva...
-
[22]
4 Missing/incorrect Single aspect Multi-dimensional Dynamic assess- ment/prognosis Concise, rigorous 4 4.2 Plan Appropriateness Evidence/reasoning supports key decisions
Plan 4 4.1 Plan Completeness Investigations, treatment, lifestyle, follow-up. 4 Missing/incorrect Single aspect Multi-dimensional Dynamic assess- ment/prognosis Concise, rigorous 4 4.2 Plan Appropriateness Evidence/reasoning supports key decisions. 3 Inappropriate/incorrectVague/unsupported Generally appropriate Clear; strong evidence –
-
[23]
3 – Basic: partial Good: most Excellent: nearly all – 5 5.2 Reasoning Skill Quality of diagnostic reasoning
Overall Competency 5 5.1 Presentation Skill Quality of written presentation. 3 – Basic: partial Good: most Excellent: nearly all – 5 5.2 Reasoning Skill Quality of diagnostic reasoning. 3 – Basic reasoning Relevant comparison Comprehensive, rigorous – 5 5.3 Decision Skill Quality of decisions in the plan. 3 – List actions only Partial reasoning Evidence-b...
-
[24]
Incomplete core modules (e.g., missing treatment or family history); module order reversed, impairing information retrieval
Structural completeness Severe omission of core modules (e.g., no HPI, PMH, or physical exam); structure is chaotic, and the basic framework is unrecognizable. Incomplete core modules (e.g., missing treatment or family history); module order reversed, impairing information retrieval. Major core modules present (HPI, PMH, physical exam), but minor modules ...
-
[25]
Timeline is vague; symptom evolution contains clear contradictions
Logical coherence No clear timeline or causal relationships; symptom sequence is contradictory, and disease course cannot be reconstructed. Timeline is vague; symptom evolution contains clear contradictions. Timeline mostly complete, but relationships between some symptoms are unclear, with occasional logical gaps. Clear timeline with explicit causal link...
-
[26]
Multiple terminology errors or non-standard abbreviations without clarification, requiring repeated inference
Terminology accuracy Frequent misuse of medical terms or self-created abbreviations renders core information uninterpretable. Multiple terminology errors or non-standard abbreviations without clarification, requiring repeated inference. Occasional imprecise terms or abbreviations that generally follow conventions but need clarification. Accurate and stand...
-
[27]
Substantial redundancy or irrelevant content; non-essential information exceeds 20% of the note
Information redundancy Large amounts of irrelevant information obscure core content; excessive verbosity overwhelms key findings. Substantial redundancy or irrelevant content; non-essential information exceeds 20% of the note. Occasional redundancy or repetition; irrelevant information below 10% and does not impair extraction. Concise information with no ...
-
[28]
Prostate Cancer
Information sufficiencyKey information is buried among secondary content and not emphasized, making it easy to miss. Some key findings are insufficiently highlighted and require careful searching to identify. Most key information is reasonably placed but not emphasized through formatting or structure. Key information (e.g., diagnostic evidence or critical...
-
[29]
difficulty in urination for half a year and a confirmed diagnosis of prostate cancer for 2 months
**Patient profile:** Male, 73 years old, admitted due to “difficulty in urination for half a year and a confirmed diagnosis of prostate cancer for 2 months.”
-
[30]
prostate cancer
**History of present illness:** The patient began experiencing progressive difficulty in urination half a year ago, accompanied by urinary frequency and nocturia 3–4 times per night. Two months ago, he was hospitalized at XX Municipal Central Hospital for **acute urinary retention**. Laboratory testing revealed a PSA level greater than 155 µg/L. A prostat...
-
[31]
3.2 **Surgical, trauma, and transfusion history:** History of open appendectomy more than 30 years ago
**Past medical history:** 3.1 **Chronic diseases:** History of gout for over 30 years and hypertension for more than 20 years, currently well controlled with oral medications. 3.2 **Surgical, trauma, and transfusion history:** History of open appendectomy more than 30 years ago. Denies other surgeries or blood transfusions. 3.3 **Marital, reproductive, an...
-
[32]
Pain score: 0
**Physical examination:** Conscious and alert, in fair general condition. Pain score: 0. Respiratory rate: 18 breaths/min; Oral temperature: 36.7°C; Pulse: 69 beats/min; Blood pressure: 131/71 mmHg. No cyanosis of the lips. No palpable superficial lymphadenopathy. Cardiac and pulmonary auscultation revealed no significant abnormalities. Abdomen soft, with...
-
[33]
- **Tumor markers (2017-11-12):** Total PSA >155.00 µg/L; free PSA 8.87 µg/L
**Auxiliary examinations:** - **Haining Central Hospital (2017-11-11):** Urinary system ultrasound showed mild bilateral hydronephrosis, prostatic hyperplasia, and post-void residual urine volume of approximately 300 mL. - **Tumor markers (2017-11-12):** Total PSA >155.00 µg/L; free PSA 8.87 µg/L. - **Prostate biopsy pathology (2017-11-24):**
2017
-
[34]
*Left inner prostate:* 3 cores obtained; prostate cancer identified in 2 cores; Gleason score 4+3=7; tumor involvement 30%
-
[35]
*Left outer prostate:* 3 cores obtained; prostate cancer in all 3 cores; Gleason score 4+4=8; tumor involvement 50%
-
[36]
*Right inner prostate:* 3 cores obtained; prostate cancer in all 3 cores; Gleason score 4+3=7; tumor involvement 20%
-
[37]
*Right outer prostate:* 3 cores obtained; prostate cancer in all 3 cores; Gleason score 5+3=8; tumor involvement 60%
-
[38]
------ ## Preliminary Diagnosis
*Suspicious area of left outer prostate:* 3 cores obtained; prostate cancer in 2 cores; Gleason score 5+3=8; tumor involvement 70%. ------ ## Preliminary Diagnosis
-
[39]
difficulty in urination for half a year and a confirmed diagnosis of prostate cancer for 2 months
Status post appendectomy ------ ## Diagnostic Discussion ### Diagnostic Basis The patient is a 72-year-old male admitted due to “difficulty in urination for half a year and a confirmed diagnosis of prostate cancer for 2 months.” He developed progressive urinary obstruction with urinary frequency and nocturia half a year ago, consistent with symptoms assoc...
-
[40]
However, the markedly elevated PSA and biopsy-confirmed malignancy exclude isolated BPH
**Benign prostatic hyperplasia (BPH):** The patient presents with urinary obstruction, frequency, and nocturia, and digital rectal examination shows prostate enlargement, which may resemble BPH. However, the markedly elevated PSA and biopsy-confirmed malignancy exclude isolated BPH
-
[41]
**Urinary tract infection:** May cause urinary symptoms and hematuria; however, the patient has no fever or dysuria, and the significantly elevated PSA favors malignancy
-
[42]
**Bladder tumor:** Can present with hematuria and urinary obstruction, but prostate biopsy has already confirmed prostate cancer, and there is no current evidence of bladder mass lesions
-
[43]
**Neurogenic bladder:** Can result in urinary retention and increased residual urine volume, but the patient has no history of neurologic disease and normal anal sphincter tone
-
[44]
------ ## Treatment Plan ### Further diagnostic evaluation According to EAU guidelines, the patient has high-risk prostate cancer (Gleason score 7–8, PSA >155 µg/L)
**Urethral stricture:** May cause urinary obstruction, but in an elderly male with markedly elevated PSA, prostate cancer is more consistent with the clinical picture. ------ ## Treatment Plan ### Further diagnostic evaluation According to EAU guidelines, the patient has high-risk prostate cancer (Gleason score 7–8, PSA >155 µg/L). Comprehensive staging e...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.