MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis

Ashmal Vayani; Joseph Fioresi; Mubarak Shah; Parth Parag Kulkarni; Song Wang

arxiv: 2604.06180 · v1 · submitted 2026-02-05 · 📡 eess.IV · cs.CV· cs.LG· cs.MA

MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis

Ashmal Vayani , Parth Parag Kulkarni , Joseph Fioresi , Song Wang , Mubarak Shah This is my paper

Pith reviewed 2026-05-16 06:41 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.LGcs.MA

keywords multi-agent systemsreinforcement learningmedical diagnosislarge multimodal modelsdynamic routingspecialist agentsLMM agents

0 comments

The pith

MedRoute trains a reinforcement learning router to dynamically assign medical queries to specialist LMM agents, raising diagnostic accuracy above static multi-agent baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MedRoute as a multi-agent system that emulates clinical practice by combining a General Practitioner, multiple specialist LMM agents, and a Moderator. A key addition is an RL-trained router inside the General Practitioner that learns to pick the most suitable specialist for each text or image query instead of using fixed assignments. This dynamic selection is evaluated on medical datasets and shown to improve accuracy over prior approaches. The work addresses the limitation that single general LMMs perform poorly across the broad range of real medical conditions. If correct, the result suggests that routing mechanisms can make multi-agent AI more practical for diagnosis tasks.

Core claim

MedRoute deploys a collaborative group of specialist LMM agents together with a General Practitioner that contains an RL-trained router for dynamic specialist selection and a Moderator that produces the final diagnosis. This structure allows the system to adapt specialist choice to the specific input rather than relying on static or predefined routing, and evaluations on text and image medical datasets confirm higher diagnostic accuracy than state-of-the-art baselines.

What carries the argument

The RL-trained router inside the General Practitioner agent, which selects the appropriate specialist LMM for each query based on learned rewards from diagnostic outcomes.

If this is right

Dynamic specialist selection produces higher diagnostic accuracy than static or predefined multi-agent setups on both text and image medical data.
The framework replicates real clinical workflows more closely by separating general intake, specialist input, and final moderation.
The approach works for both textual medical questions and visual inputs such as images.
It supplies a reusable template for future multi-agent LMM research in healthcare.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the router generalizes beyond the training distribution, similar dynamic selection could reduce reliance on ever-larger single models in other specialized domains.
Adding patient history or lab results as extra router inputs might further improve selection quality without changing the overall architecture.
The same routing idea could be tested in non-medical settings that require choosing among expert agents, such as technical support or legal review.

Load-bearing premise

An RL router trained on the available medical datasets will continue to choose suitable specialists for the full variety of real-world conditions without overfitting or needing repeated retraining.

What would settle it

Run MedRoute on a held-out collection of rare or previously unseen medical conditions and measure whether its accuracy advantage over static baselines disappears or reverses.

Figures

Figures reproduced from arXiv: 2604.06180 by Ashmal Vayani, Joseph Fioresi, Mubarak Shah, Parth Parag Kulkarni, Song Wang.

**Figure 1.** Figure 1: Specialist Consultation in Correct Order leads to More Accurate Diagnosis. Image shows a knee X-ray of an Osteomyelitis patient. Previous works suffer from lack of coordination between Specialist Agents. Our framework ensures use of prior knowledge for next specialist allocation with a dynamic router resulting in better diagnosis. 2025c; Campos et al., 2025). This has also given rise to specialized models… view at source ↗

**Figure 2.** Figure 2: Schematic Illustration of our flexible multi-agent framework The question along with the image is input into the General Practitioner(GP) Agent which is a router for specialist allocation. The GP inputs all potential agents from the specialist pool and allocates the first specialist based on the question (Neurologist in this case). The Neurologist agent gives its diagnosis which is passed into the GP as hi… view at source ↗

**Figure 3.** Figure 3: Schematic Illustration of Router The image (if multimodal), is input into the image captioner agent to capture detail for specialist allocation. This caption along with the question is input into the T E(.) to generate a combined task embedding. This is simultaneously used to generate Specialist Vector and Specialist History Vector representing the next specialist and previously selected specialists respec… view at source ↗

**Figure 4.** Figure 4: Qualitative Example with our Framework The input image shows axial CT scans of the heart and an ECG, which the question asks about. Out of the variety of specialists in the pool, the GP agent first selects a Cardiologist, which analyzes the ECG to give its diagnosis. Based on this diagnosis the GP consults a Thoracic Surgeon that gives a diagnosis consistent with the previous agent. Finally the GP routes t… view at source ↗

**Figure 5.** Figure 5: Prompt for Specialist Recommendation Generation Each specialist has a unique set of roles and responsibilities. A few examples of the roles and their responsibilities are shown in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Responsibilities for 5 different specialists Moderator Call 00 You are the top decision-maker and are good at analyzing and summarizing other people's opinions, finding errors, and giving final answers.. I will ask you a question. I will also give you 4 answers enumerated as A, B, C, and D. Only one answer out of the offered 4 is correct. You must choose the correct answer to the question. Your response mu… view at source ↗

**Figure 7.** Figure 7: Prompt for Moderator Call C. Prompts used for Evaluation Evaluation of the model performance can be done in 2 ways depending on the nature of the question. • Case 1: Multiple Choice Questions: if the questions are multiple choice, evaluation needs to compare the chosen option with the index of the ground truth option, if the model outputs in that specific format. More often than not, the model does not adh… view at source ↗

**Figure 8.** Figure 8: Prompt for Evaluation of MCQ Type Questions D. Dataset Details D.1. DeepLesion The DeepLesion(Yan et al., 2018) dataset is a large-scale, clinically derived collection of CT scans designed for automated lesion detection and analysis. It contains over 32,000 axial CT slices from more than 4,500 patients, encompassing approximately 32,735 lesions across a variety of organs, including the lung, liver, bone, s… view at source ↗

**Figure 9.** Figure 9: Prompt for Evaluation of Open-ended Type Questions To generate data fit for our framework, we formulate each sample into a multiple choice question. The options are the aforementioned coarse labels, while the question is randomly chosen out of these options: • What category best describes the lesion shown in this image? • Which anatomical region does the lesion in this image most likely originate from? • B… view at source ↗

**Figure 10.** Figure 10: Example QA pairs from Vision Datasets PathVQA and PMC-VQA E. Additional Results on a dataset focusing on a single anatomical feature NIH Chestxray8(Wang et al., 2017) dataset is a large publicly available dataset of frontal-view chest X-ray images designed to facilitate automated thoracic disease detection and classification. It contains over 112,000 X-ray images from more than 30,000 unique patients, ann… view at source ↗

read the original abstract

Medical diagnosis using Large Multimodal Models (LMMs) has gained increasing attention due to capability of these models in providing precise diagnoses. These models generally combine medical questions with visual inputs to generate diagnoses or treatments. However, they are often overly general and unsuitable under the wide range of medical conditions in real-world healthcare. In clinical practice, diagnosis is performed by multiple specialists, each contributing domain-specific expertise. To emulate this process, a potential solution is to deploy a dynamic multi-agent LMM framework, where each agent functions as a medical specialist. Current approaches in this emerging area, typically relying on static or predefined selection of various specialists, cannot be adapted to the changing practical scenario. In this paper, we propose MedRoute, a flexible and dynamic multi-agent framework that comprises of a collaborative system of specialist LMM agents. Furthermore, we add a General Practitioner with an RL-trained router for dynamic specialist selection, and a Moderator that produces the final decision. In this way, our framework closely mirrors real clinical workflows. Extensive evaluations on text and image-based medical datasets demonstrate improved diagnostic accuracy, outperforming the state-of-the-art baselines. Our work lays a strong foundation for future research. Code and models are available at https://github.com/UCF-CRCV/MedRoute/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedRoute adds an RL router to pick specialist LMM agents for medical cases and reports accuracy gains on tested datasets, but the training and generalization details stay thin.

read the letter

MedRoute trains a reinforcement learning router to dynamically choose among specialist LMM agents for medical diagnosis queries, then relies on a moderator for the final output. The evaluations show improved accuracy over baselines on text and image medical datasets. The new element is this RL-driven routing mechanism in a multi-agent medical setup, which moves beyond static specialist selection to something more adaptive. The framework includes a general practitioner agent for the routing decision and aims to reflect real clinical processes. Making the code and models available on GitHub is a solid step that lets others reproduce the router training. The paper handles the agent collaboration reasonably well by breaking down the diagnosis into specialist contributions with dynamic selection. Testing across both text and image inputs demonstrates the approach works for varied data types common in medicine. Where it falls short is in the supporting details for the RL component. The description does not cover the reward design, state representation for the router, or any out-of-distribution tests to check generalization beyond the training datasets. This leaves open whether the accuracy improvements stem from effective routing or from fitting to the specific evaluation sets. The baselines used and any statistical analysis are also not specified, making it tough to gauge how meaningful the gains are. This is the kind of paper that would interest researchers in multi-agent AI for healthcare. A reader focused on building routing systems for diagnostic models could extract useful ideas from the architecture and the released implementation. I recommend sending it to peer review. The idea is practical and the code availability means the work can be examined closely by referees who can point out where more validation is needed.

Referee Report

3 major / 2 minor

Summary. The paper proposes MedRoute, a dynamic multi-agent framework for medical diagnosis using Large Multimodal Models (LMMs). It consists of specialist LMM agents, a General Practitioner agent with an RL-trained router for dynamic specialist selection, and a Moderator that produces the final diagnosis. The authors claim that this setup closely mirrors clinical workflows and achieves improved diagnostic accuracy over state-of-the-art baselines on text and image-based medical datasets, with code and models released.

Significance. If the empirical claims hold under rigorous validation, the work could advance multi-agent LMM systems by demonstrating a practical, adaptive routing mechanism that emulates real clinical specialist collaboration. The open release of code supports reproducibility and future extensions in medical AI.

major comments (3)

[§4] §4 (Experiments): No description is given of the RL training procedure, including the reward function, state representation, action space, or optimization algorithm used for the router. Without these, it is impossible to evaluate whether the reported accuracy gains stem from the dynamic routing or from other factors.
[§4.2] §4.2 (Evaluation): The manuscript provides no details on the specific baselines, dataset splits, statistical significance tests, or ablation studies isolating the RL router's contribution. This undermines the claim of outperforming SOTA methods.
[§4.3] §4.3 (Generalization): There are no out-of-distribution or long-tail condition experiments testing the router on unseen medical scenarios. This is load-bearing for the central claim that the framework reliably handles real-world variability.

minor comments (2)

[Abstract] The abstract states 'extensive evaluations' but reports no quantitative accuracy numbers or improvement margins, which would strengthen the summary of results.
[§3] Notation for the router policy and moderator decision process could be formalized with equations in §3 to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to provide the requested details on the RL procedure, evaluation protocol, and generalization experiments.

read point-by-point responses

Referee: [§4] §4 (Experiments): No description is given of the RL training procedure, including the reward function, state representation, action space, or optimization algorithm used for the router. Without these, it is impossible to evaluate whether the reported accuracy gains stem from the dynamic routing or from other factors.

Authors: We agree that the RL training details were insufficiently described. In the revised manuscript we will add a new subsection in §4 that explicitly defines the state representation (query embedding concatenated with patient history), action space (discrete selection over the specialist agents), reward function (weighted combination of final diagnostic accuracy and number of consultations), and optimization algorithm (PPO with the specific hyperparameters used). revision: yes
Referee: [§4.2] §4.2 (Evaluation): The manuscript provides no details on the specific baselines, dataset splits, statistical significance tests, or ablation studies isolating the RL router's contribution. This undermines the claim of outperforming SOTA methods.

Authors: We acknowledge the omission. The revision will expand §4.2 to list all baselines (static routing, single LMM, and prior multi-agent systems), report the exact train/validation/test splits for each dataset, include statistical significance results (paired t-tests and McNemar’s test with p-values), and add ablation experiments that disable the RL router while keeping all other components fixed. revision: yes
Referee: [§4.3] §4.3 (Generalization): There are no out-of-distribution or long-tail condition experiments testing the router on unseen medical scenarios. This is load-bearing for the central claim that the framework reliably handles real-world variability.

Authors: We agree that explicit OOD and long-tail tests strengthen the central claim. The revised §4.3 will include new experiments on held-out disease categories and distribution-shifted image/text data, with quantitative results and analysis of router behavior under these conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy claims rest on dataset evaluations, not self-referential derivations

full rationale

The paper presents MedRoute as an RL-based multi-agent framework and supports its central claim of improved diagnostic accuracy solely through 'extensive evaluations on text and image-based medical datasets' that 'outperform the state-of-the-art baselines.' No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The RL router is described as trained for dynamic selection, but the accuracy result is reported as an external experimental outcome rather than a quantity that reduces to the training inputs by construction. This is the expected non-finding for an applied systems paper whose load-bearing evidence is benchmark performance.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework relies on standard RL and LMM components assumed from prior literature.

pith-pipeline@v0.9.0 · 5548 in / 1032 out tokens · 28501 ms · 2026-05-16T06:41:40.736842+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train this router using reinforcement learning using the validity of final diagnosis as our reward... LPG(θ) =−1/G ∑ log πθ(sp′t|ipt)·At

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Medalpaca–an open-source collection of medical conversational ai models and training data

PMLR, 2016. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025. Han, T., Adams, L. C., Papaioannou, J.-M., Grundmann, P., Oberhauser, T., L¨oser, A., Truhn, D., and Bressem, K. K. Medalpaca–an open-so...

work page arXiv 2016
[2]

Step 1: Collect all the questions in the dataset

work page
[3]

Step 2: Prompt GPT-4.1-mini(Achiam et al., 2023) to recommend 3-7 specialists which can solve the given question

work page 2023
[4]

Make a list of all specialists recommended for samples of a dataset

work page
[5]

Count the number of data points a particular specialist is called for

work page
[6]

For general purpose QA/VQA datasets k is between 50-60

Take the top-k specialists to form the pool The number of specialists k depends on the dataset. For general purpose QA/VQA datasets k is between 50-60. The prompt for generating specialist recommendations is shown in Fig. 5. 00Specialist Recommendation Generation Output ONLY valid JSON. Do not include markdown, explanations, or extra text Rules: - special...

work page
[7]

Option 2

Extract the final chosen option from the predicted answer. - Prefer an explicit option index/label if stated (e.g., "Option 2", "2", "(B)", "B"). - If the correct option is predicted, and the reasoning is wrong, still mark it correct. - If multiple options are mentioned, use the final committed answer

work page
[8]

If no explicit option is stated, infer the implied choice from the conclusion

work page
[9]

Compare the extracted choice to the ground-truth choice

work page
[10]

incorrect

Be conservative: if ambiguous or non-committal, mark "incorrect". Output format (STRICT): Return ONLY a valid JSON object with exactly these keys: - "result": "correct" or "incorrect" - "reason": a brief explanation (one sentence) Do not output any other text. Figure 8.Prompt for Evaluation of MCQ Type Questions D. Dataset Details D.1. DeepLesion The Deep...

work page 2018
[11]

Inputs: Question: {question} Ground-truth answer: {gt_answer} Predicted answer: {pred_ans} Evaluation guidelines:

pelvis 15 MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis 00Open-ended Evaluation Determine whether the model's predicted answer matches the ground-truth answer. Inputs: Question: {question} Ground-truth answer: {gt_answer} Predicted answer: {pred_ans} Evaluation guidelines:

work page
[12]

Compare the predicted answer to the ground-truth answer semantically

work page
[13]

Minor wording differences are acceptable if meaning is equivalent

work page
[14]

Extra correct information is acceptable

work page
[15]

incorrect

Missing key facts, incorrect claims, or ambiguity should be marked "incorrect"

work page
[16]

incorrect

Be conservative: if unsure, mark "incorrect". Output format (STRICT): Return ONLY a valid JSON object with exactly these keys: - "result": "correct" or "incorrect" - "reason": a brief explanation (one sentence) Do not output any other text. Figure 9.Prompt for Evaluation of Open-ended Type Questions To generate data fit for our framework, we formulate eac...

work page 2020

[1] [1]

Medalpaca–an open-source collection of medical conversational ai models and training data

PMLR, 2016. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025. Han, T., Adams, L. C., Papaioannou, J.-M., Grundmann, P., Oberhauser, T., L¨oser, A., Truhn, D., and Bressem, K. K. Medalpaca–an open-so...

work page arXiv 2016

[2] [2]

Step 1: Collect all the questions in the dataset

work page

[3] [3]

Step 2: Prompt GPT-4.1-mini(Achiam et al., 2023) to recommend 3-7 specialists which can solve the given question

work page 2023

[4] [4]

Make a list of all specialists recommended for samples of a dataset

work page

[5] [5]

Count the number of data points a particular specialist is called for

work page

[6] [6]

For general purpose QA/VQA datasets k is between 50-60

Take the top-k specialists to form the pool The number of specialists k depends on the dataset. For general purpose QA/VQA datasets k is between 50-60. The prompt for generating specialist recommendations is shown in Fig. 5. 00Specialist Recommendation Generation Output ONLY valid JSON. Do not include markdown, explanations, or extra text Rules: - special...

work page

[7] [7]

Option 2

Extract the final chosen option from the predicted answer. - Prefer an explicit option index/label if stated (e.g., "Option 2", "2", "(B)", "B"). - If the correct option is predicted, and the reasoning is wrong, still mark it correct. - If multiple options are mentioned, use the final committed answer

work page

[8] [8]

If no explicit option is stated, infer the implied choice from the conclusion

work page

[9] [9]

Compare the extracted choice to the ground-truth choice

work page

[10] [10]

incorrect

Be conservative: if ambiguous or non-committal, mark "incorrect". Output format (STRICT): Return ONLY a valid JSON object with exactly these keys: - "result": "correct" or "incorrect" - "reason": a brief explanation (one sentence) Do not output any other text. Figure 8.Prompt for Evaluation of MCQ Type Questions D. Dataset Details D.1. DeepLesion The Deep...

work page 2018

[11] [11]

Inputs: Question: {question} Ground-truth answer: {gt_answer} Predicted answer: {pred_ans} Evaluation guidelines:

pelvis 15 MedRoute: RL-Based Dynamic Specialist Routing in Multi-Agent Medical Diagnosis 00Open-ended Evaluation Determine whether the model's predicted answer matches the ground-truth answer. Inputs: Question: {question} Ground-truth answer: {gt_answer} Predicted answer: {pred_ans} Evaluation guidelines:

work page

[12] [12]

Compare the predicted answer to the ground-truth answer semantically

work page

[13] [13]

Minor wording differences are acceptable if meaning is equivalent

work page

[14] [14]

Extra correct information is acceptable

work page

[15] [15]

incorrect

Missing key facts, incorrect claims, or ambiguity should be marked "incorrect"

work page

[16] [16]

incorrect

Be conservative: if unsure, mark "incorrect". Output format (STRICT): Return ONLY a valid JSON object with exactly these keys: - "result": "correct" or "incorrect" - "reason": a brief explanation (one sentence) Do not output any other text. Figure 9.Prompt for Evaluation of Open-ended Type Questions To generate data fit for our framework, we formulate eac...

work page 2020