ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Akash Ghosh; Arijit Roy; Sriparna Saha; Subhadip Baidya; Tanmoy Kanti Halder

arxiv: 2606.13572 · v1 · pith:YUNSZPLZnew · submitted 2026-06-11 · 💻 cs.CL · cs.AI

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Tanmoy Kanti Halder , Akash Ghosh , Subhadip Baidya , Arijit Roy , Sriparna Saha This is my paper

Pith reviewed 2026-06-27 06:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multimodal medical reasoningIndic languagesmulti-agent frameworkactor-criticmultilingual datasetmedical QAlow-resource languagestool grounding

0 comments

The pith

An actor-critic multi-agent framework paired with a new Indic-language medical dataset raises reasoning accuracy on multimodal queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs ArogyaBodha, a large multilingual multimodal medical QA dataset drawn from eight sources that spans seven Indic languages plus English, 31 body systems, six imaging modalities, and 21 clinical domains. It introduces ArogyaSutra, an actor-critic multi-agent system that adds tool grounding and dual-memory mechanisms to produce step-wise reasoning decisions, then distills the resulting trajectories. Experiments report higher accuracy on medical reasoning tasks across all tested Indic languages when the dataset and framework are used together. Ablation results attribute measurable gains to each added component. The work focuses on the gap left by English-centric models for healthcare queries that arrive with images in rural or low-resource language settings.

Core claim

ArogyaBodha is a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. ArogyaSutra is an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making and uses stored actor-critic simulation trajectories for distillation. Experiments show that the dataset and framework together improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component.

What carries the argument

ArogyaSutra, the actor-critic multi-agent framework that performs tool-grounded, memory-augmented step-wise reasoning and distills trajectories from its own simulations.

If this is right

Accuracy on medical reasoning tasks rises for every one of the seven Indic languages tested.
Removing any single component of the actor-critic setup or the dataset produces a measurable drop in performance.
The framework handles multimodal inputs such as medical images alongside text queries.
Distillation from stored simulation trajectories transfers the step-wise reasoning behavior into the final model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same actor-critic plus distillation pattern could be applied to other specialized low-resource language tasks outside medicine.
Deployment in actual clinics would require separate checks on how the system handles noisy or incomplete real-world queries.
The approach offers one route for adapting general-purpose multimodal models to new languages without full retraining from scratch.

Load-bearing premise

The measured accuracy gains come from the actor-critic multi-agent design and dataset structure rather than from hidden implementation choices or non-representative source material.

What would settle it

Re-running the full evaluation pipeline on an independently collected set of real Indic-language patient queries with images and finding no accuracy improvement over baseline MLLMs.

Figures

Figures reproduced from arXiv: 2606.13572 by Akash Ghosh, Arijit Roy, Sriparna Saha, Subhadip Baidya, Tanmoy Kanti Halder.

**Figure 1.** Figure 1: Overview of the ArogyaSutra framework. ArogyaSutra employs an actor–critic architecture enhanced with tool-based image grounding and adaptive code-switching. The Actor first processes the input prompt and identifies the need for visual grounding, invoking appropriate tool agents to extract clinically relevant information from medical images before generating an answer with its associated reasoning. This o… view at source ↗

**Figure 2.** Figure 2: Overview of qualitative comparisons between [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New Indic medical multimodal dataset is the real contribution here; the actor-critic multi-agent claims rest on unshown baselines and controls.

read the letter

The main thing to know is that this paper ships a new dataset ArogyaBodha built from eight sources across seven Indic languages plus English, covering medical images and questions in 31 body systems. That fills a genuine gap for low-resource medical reasoning. The ArogyaSutra framework adds an actor-critic multi-agent setup with tool use, dual memory, and distillation from simulation trajectories.

What works is the practical focus. Rural India has real demand for models that handle native-language queries with scans or reports, and the authors correctly flag that English-centric MLLMs fall short. Releasing both code and data is the right move and lets others test the claims directly.

The soft spot is the experimental section. The abstract says ablations confirm each component improves accuracy, but gives no numbers, no single-agent baseline on the same data, no held-out real-patient test set, and no check for translation artifacts or source overlap. Without those, it is hard to tell whether gains come from the new data or from the actor-critic machinery. The stress-test note is right on this point.

This is for groups already working on multilingual medical AI or Indic-language applications. A reader who needs the dataset or wants to extend the multi-agent pattern will get value; someone looking for a clean methodological advance will find the evidence thin.

It deserves a serious referee. The dataset alone is worth review, and the framework is a reasonable engineering response to the setting. The authors should be asked to add the missing controls and metrics before acceptance, but the work is coherent enough to go out for comments.

Referee Report

3 major / 1 minor

Summary. The paper introduces ArogyaBodha, a large-scale multilingual multimodal medical QA dataset built from eight heterogeneous sources spanning 31 body systems, six modalities, and seven Indic languages plus English, and proposes ArogyaSutra, an actor-critic multi-agent framework that combines tool grounding, dual-memory mechanisms, and distillation from simulation trajectories. It claims that the dataset and framework together improve medical reasoning accuracy across Indic languages, with ablations confirming the value of each component.

Significance. If the experimental isolation of the multi-agent contributions holds, the work would supply both a needed resource for low-resource medical reasoning and evidence for actor-critic designs in specialized multimodal settings; the public release of code and dataset is a clear positive.

major comments (3)

[§5 (Experiments)] §5 (Experiments): The central claim that accuracy gains are attributable to the actor-critic multi-agent structure, tool grounding, and dual-memory (rather than properties of the new ArogyaBodha dataset) requires explicit single-agent MLLM baselines fine-tuned on the identical training split; without these controls the attribution remains unverified.
[§3 (Dataset)] §3 (Dataset): The construction from eight heterogeneous sources does not specify whether test queries are drawn from held-out real patient data or from the same sources, nor does it report controls for translation artifacts or domain overlap; this directly affects whether observed gains can be credited to the framework.
[§5 (Experiments)] §5 (Experiments) and abstract: No numerical metrics, baseline scores, statistical tests, or error analysis are referenced, so the assertion that “ablations validate the contribution of each component” cannot be assessed for magnitude or robustness.

minor comments (1)

[Abstract] The dataset link in the abstract is truncated ('https://iitp-cse.github.io/ ArogyaSutra/').

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental controls, dataset transparency, and the need for quantitative rigor. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§5 (Experiments)] The central claim that accuracy gains are attributable to the actor-critic multi-agent structure, tool grounding, and dual-memory (rather than properties of the new ArogyaBodha dataset) requires explicit single-agent MLLM baselines fine-tuned on the identical training split; without these controls the attribution remains unverified.

Authors: We agree that the current experimental design does not fully isolate the multi-agent contributions. The manuscript compares ArogyaSutra primarily against base MLLMs without fine-tuning on ArogyaBodha, which leaves open the possibility that gains stem from the dataset rather than the framework. In the revised manuscript we will add single-agent MLLM baselines that are fine-tuned on the identical ArogyaBodha training split. These controls will allow direct attribution of improvements to the actor-critic structure, tool grounding, and dual-memory components. revision: yes
Referee: [§3 (Dataset)] The construction from eight heterogeneous sources does not specify whether test queries are drawn from held-out real patient data or from the same sources, nor does it report controls for translation artifacts or domain overlap; this directly affects whether observed gains can be credited to the framework.

Authors: We will revise §3 to explicitly describe the train/test split construction. The test queries are drawn from held-out portions of the source distributions with no patient-case overlap where identifiable. We will also add details on translation artifact controls (expert review and back-translation verification) and domain-overlap mitigation steps. These clarifications will strengthen the claim that performance gains can be credited to the ArogyaSutra framework. revision: yes
Referee: [§5 (Experiments)] and abstract: No numerical metrics, baseline scores, statistical tests, or error analysis are referenced, so the assertion that “ablations validate the contribution of each component” cannot be assessed for magnitude or robustness.

Authors: We acknowledge that the current manuscript version does not include explicit numerical metrics, baseline scores, statistical tests, or error analysis to support the ablation claims. In the revision we will expand §5 with concrete accuracy numbers for all baselines and ablations, add statistical significance testing, and include a dedicated error analysis subsection. The abstract will be updated to reference key quantitative results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on newly constructed dataset and framework without reducing to fitted inputs or self-citations

full rationale

The paper introduces ArogyaBodha (new dataset from eight sources) and ArogyaSutra (new actor-critic multi-agent framework with tool grounding and dual-memory), then reports experimental accuracy gains with ablations. No equations, fitted parameters, or derivation chain appear in the provided text. Claims do not reduce any prediction to its own inputs by construction, nor rely on load-bearing self-citations for uniqueness or ansatzes. The central attribution of gains to the framework is an empirical question (addressed via ablations) rather than a definitional or self-referential reduction. This is a standard self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are explicitly described in the abstract.

pith-pipeline@v0.9.1-grok · 5766 in / 1070 out tokens · 16785 ms · 2026-06-27T06:54:09.354288+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references

[1]

M3retrieve: Benchmarking multimodal re- trieval for medicine

[Acharyaet al., 2025 ] Arkadeep Acharya, Akash Ghosh, Pradeepika Verma, Kitsuchart Pasupa, Sriparna Saha, and Priti Singh. M3retrieve: Benchmarking multimodal re- trieval for medicine. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Process- ing, pages 15274–15287,

2025
[2]

InternVL: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks

[Chenet al., 2024 ] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qing- long Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Patt...

2024
[3]

Dorfner, Amin Dada, Felix Busch, Marcus R

[Dorfneret al., 2024 ] Felix J. Dorfner, Amin Dada, Felix Busch, Marcus R. Makowski, Tianyu Han, Daniel Truhn, Jens Kleesiek, Madhumita Sushil, Jacqueline Lammert, Lisa C. Adams, and Keno K. Bressem. Biomedical large languages models seem not to be superior to generalist models on unseen medical data.CoRR, abs/2408.13833,

arXiv 2024
[4]

Carepilot: A multi-agent framework for long-horizon computer task automation in healthcare

[Ghoshet al., 2026 ] Akash Ghosh, Tajamul Ashraf, Rishu Kumar Singh, Numan Saeed, Sriparna Saha, Xiuy- ing Chen, and Salman Khan. Carepilot: A multi-agent framework for long-horizon computer task automation in healthcare. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9695–9705,

2026
[5]

DeepSeek-R1 incen- tivizes reasoning in LLMs through reinforcement learning

[Guoet al., 2025 ] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1 incen- tivizes reasoning in LLMs through reinforcement learning. Nature, 645(8081):633–638,

2025
[6]

Medical multimodal large language models: A systematic review.Intell

[Huet al., 2025 ] Yuan Hu, Chenhan Xu, Bo Lin, Weibin Yang, and Yuan Yan Tang. Medical multimodal large language models: A systematic review.Intell. Oncol., 1(4):308–325, October

2025
[7]

Directional reasoning injection for fine-tuning MLLMs,

[Huanget al., 2025 ] Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chen- liang Xu, Emad Barsoum, and Zicheng Liu. Directional reasoning injection for fine-tuning MLLMs,

2025
[8]

MedVLThinker: Sim- ple baselines for multimodal medical reasoning

[Huanget al., 2026 ] Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. MedVLThinker: Sim- ple baselines for multimodal medical reasoning. InPro- ceedings of the Fifth Machine Learning for Health Sym- posium, volume 297 ofProceedings of Machine Learning Research, pages 384–398. PMLR, 13–14 Dec

2026
[9]

Med-MoE: Mixture of domain-specific experts for lightweight medical vision- language models

[Jianget al., 2024b ] Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan, and Zuozhu Liu. Med-MoE: Mixture of domain-specific experts for lightweight medical vision- language models. InFindings of the Association for Com- putational Linguistics: EMNLP 2024, pages 3843–3860, Miami, Florida, USA, November

2024
[10]

[Jianget al., 2025 ] Yixing Jiang, Kameron C Black, Glo- ria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen

Association for Computational Linguistics. [Jianget al., 2025 ] Yixing Jiang, Kameron C Black, Glo- ria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. Medagentbench: a virtual ehr en- vironment to benchmark medical llm agents.Nejm Ai, 2(9):AIdbp2500144,

2025
[11]

BioMistral: A collection of open-source pretrained large language models for medical domains

[Labraket al., 2024 ] Yanis Labrak, Adrien Bazoge, Em- manuel Morin, Pierre-Antoine Gourraud, Mickael Rou- vier, and Richard Dufour. BioMistral: A collection of open-source pretrained large language models for medical domains. InFindings of the Association for Computational Linguistics: ACL 2024, pages 5848–5864, Bangkok, Thai- land, August

2024
[12]

[Laiet al., 2025 ] Yuxiang Lai, Jike Zhong, Ming Li, Shi- tian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang

Association for Computational Lin- guistics. [Laiet al., 2025 ] Yuxiang Lai, Jike Zhong, Ming Li, Shi- tian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang. Med-R1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Jour- nals & Magazine,

2025
[13]

Llava- med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

[Liet al., 2023 ] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava- med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

2023
[14]

MMedAgent: Learning to use medical tools with multi-modal agent

[Liet al., 2024 ] Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, and Yixin Wang. MMedAgent: Learning to use medical tools with multi-modal agent. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, Stroudsburg, PA, USA,

2024
[15]

Association for Computational Linguistics. [Liet al., 2025 ] Feiyang Li, Yingjian Chen, Haoran Liu, Rui Yang, Han Yuan, Yuang Jiang, Tianxiao Li, Edison Mar- rese Taylor, Hossein Rouhizadeh, Yusuke Iwasawa, Dou- glas Teodoro, Yutaka Matsuo, and Irene Li. MKG-Rank: Enhancing large language models with knowledge graph for multilingual medical question answering,

2025
[16]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,

[Liuet al., 2023 ] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,

2023
[17]

Drishtikon: A multi- modal multilingual benchmark for testing language mod- els’ understanding on indian culture

[Majiet al., 2025 ] Arijit Maji, Raghvendra Kumar, Akash Ghosh, Nemil Shah, Abhilekh Borah, Vanshika Shah, Nis- hant Mishra, Sriparna Saha, et al. Drishtikon: A multi- modal multilingual benchmark for testing language mod- els’ understanding on indian culture. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, page...

2025
[18]

The future of MLLM prompting is adaptive: A comprehensive ex- perimental evaluation of prompt engineering methods for robust multimodal performance,

[Mohantyet al., 2025 ] Anwesha Mohanty, Venkatesh Bal- avadhani Parthasarathy, and Arsalan Shahid. The future of MLLM prompting is adaptive: A comprehensive ex- perimental evaluation of prompt engineering methods for robust multimodal performance,

2025
[19]

Scaling neural machine translation to 200 languages.Nature, 630(8018):841–846,

[NLLB Team, 2024] NLLB Team. Scaling neural machine translation to 200 languages.Nature, 630(8018):841–846,

2024
[20]

Cure-med: Curriculum-informed reinforcement learning for multilingual medical reasoning,

[Onyameet al., 2026 ] Eric Onyame, Akash Ghosh, Sub- hadip Baidya, Sriparna Saha, Xiuying Chen, and Chirag Agarwal. Cure-med: Curriculum-informed reinforcement learning for multilingual medical reasoning,

2026
[21]

MedVLM-R1: Incentiviz- ing medical reasoning capability of vision-language mod- els (VLMs) via reinforcement learning

[Panet al., 2025 ] Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. MedVLM-R1: Incentiviz- ing medical reasoning capability of vision-language mod- els (VLMs) via reinforcement learning. InMedical Image Computing and Computer-Assisted Intervention – MIC- CAI

2025
[22]

A survey of multilingual large language models.Patterns, 6(1),

[Qinet al., 2025 ] Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, and Philip S Yu. A survey of multilingual large language models.Patterns, 6(1),

2025
[23]

Towards building multilingual language model for medicine.Nat

[Qiuet al., 2024 ] Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yan- feng Wang, and Weidi Xie. Towards building multilingual language model for medicine.Nat. Commun., 15(1):8384, September

2024
[24]

A multilingual multimodal medical examina- tion dataset for visual question answering in healthcare

[Riccioet al., 2025 ] Giuseppe Riccio, Antonio Romano, Mariano Barone, Gian Marco Orlando, Diego Russo, Marco Postiglione, Valerio La Gatta, and Vincenzo Moscato. A multilingual multimodal medical examina- tion dataset for visual question answering in healthcare. In 2025 IEEE 38th International Symposium on Computer- Based Medical Systems (CBMS), pages 435–440,

2025
[25]

AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments,

[Schmidgallet al., 2024 ] Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments,

2024
[26]

MedGemma technical report,

[Sellergrenet al., 2025 ] Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C´ıan Hughes, Charles Lau, et al. MedGemma technical report,

2025
[27]

Chiron-o1: Igniting multimodal large language models towards generalizable medical rea- soning via mentor-intern collaborative search

[Sunet al., 2025 ] Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, and Xiaosong Wang. Chiron-o1: Igniting multimodal large language models towards generalizable medical rea- soning via mentor-intern collaborative search. InAdvances in Neural Information Processing Systems,

2025
[28]

Vi- raasat: Traversing novel paths for indian cultural reason- ing,

[Suranaet al., 2026 ] Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha, and Amit Sheth. Vi- raasat: Traversing novel paths for indian cultural reason- ing,

2026
[29]

Medagents: Large language models as collaborators for zero-shot medical reasoning

[Tanget al., 2024 ] Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InFindings of the Association for Computational Linguis- tics: ACL 2024, pages 599–621,

2024
[30]

Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,

[Wanget al., 2024 ] Peng Wang, Shuai Bai, Sinan Tan, Shi- jie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,

2024
[31]

MMedAgent-RL: Opti- mizing multi-agent collaboration for multimodal medical reasoning

[Xiaet al., 2026 ] Peng Xia, Jinglu Wang, Yibo Peng, Kaide Zeng, Xian Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Shu- jie Liu, Yan Lu, and Huaxiu Yao. MMedAgent-RL: Opti- mizing multi-agent collaboration for multimodal medical reasoning. InInternational Conference on Learning Rep- resentations,

2026
[32]

Mul- timodal large language models for medical image diag- nosis: Challenges and opportunities.J

[Zhanget al., 2025 ] Andrew Zhang, Eric Zhao, Ruirui Wang, Xiuqi Zhang, Justin Wang, and Ethan Chen. Mul- timodal large language models for medical image diag- nosis: Challenges and opportunities.J. Biomed. Inform., 169(104895):104895, September

2025
[33]

Reinforced visual per- ception with tools, 2025

[Zhouet al., 2025b ] Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual per- ception with tools, 2025

2025

[1] [1]

M3retrieve: Benchmarking multimodal re- trieval for medicine

[Acharyaet al., 2025 ] Arkadeep Acharya, Akash Ghosh, Pradeepika Verma, Kitsuchart Pasupa, Sriparna Saha, and Priti Singh. M3retrieve: Benchmarking multimodal re- trieval for medicine. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Process- ing, pages 15274–15287,

2025

[2] [2]

InternVL: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks

[Chenet al., 2024 ] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qing- long Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Patt...

2024

[3] [3]

Dorfner, Amin Dada, Felix Busch, Marcus R

[Dorfneret al., 2024 ] Felix J. Dorfner, Amin Dada, Felix Busch, Marcus R. Makowski, Tianyu Han, Daniel Truhn, Jens Kleesiek, Madhumita Sushil, Jacqueline Lammert, Lisa C. Adams, and Keno K. Bressem. Biomedical large languages models seem not to be superior to generalist models on unseen medical data.CoRR, abs/2408.13833,

arXiv 2024

[4] [4]

Carepilot: A multi-agent framework for long-horizon computer task automation in healthcare

[Ghoshet al., 2026 ] Akash Ghosh, Tajamul Ashraf, Rishu Kumar Singh, Numan Saeed, Sriparna Saha, Xiuy- ing Chen, and Salman Khan. Carepilot: A multi-agent framework for long-horizon computer task automation in healthcare. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9695–9705,

2026

[5] [5]

DeepSeek-R1 incen- tivizes reasoning in LLMs through reinforcement learning

[Guoet al., 2025 ] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1 incen- tivizes reasoning in LLMs through reinforcement learning. Nature, 645(8081):633–638,

2025

[6] [6]

Medical multimodal large language models: A systematic review.Intell

[Huet al., 2025 ] Yuan Hu, Chenhan Xu, Bo Lin, Weibin Yang, and Yuan Yan Tang. Medical multimodal large language models: A systematic review.Intell. Oncol., 1(4):308–325, October

2025

[7] [7]

Directional reasoning injection for fine-tuning MLLMs,

[Huanget al., 2025 ] Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chen- liang Xu, Emad Barsoum, and Zicheng Liu. Directional reasoning injection for fine-tuning MLLMs,

2025

[8] [8]

MedVLThinker: Sim- ple baselines for multimodal medical reasoning

[Huanget al., 2026 ] Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. MedVLThinker: Sim- ple baselines for multimodal medical reasoning. InPro- ceedings of the Fifth Machine Learning for Health Sym- posium, volume 297 ofProceedings of Machine Learning Research, pages 384–398. PMLR, 13–14 Dec

2026

[9] [9]

Med-MoE: Mixture of domain-specific experts for lightweight medical vision- language models

[Jianget al., 2024b ] Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan, and Zuozhu Liu. Med-MoE: Mixture of domain-specific experts for lightweight medical vision- language models. InFindings of the Association for Com- putational Linguistics: EMNLP 2024, pages 3843–3860, Miami, Florida, USA, November

2024

[10] [10]

[Jianget al., 2025 ] Yixing Jiang, Kameron C Black, Glo- ria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen

Association for Computational Linguistics. [Jianget al., 2025 ] Yixing Jiang, Kameron C Black, Glo- ria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. Medagentbench: a virtual ehr en- vironment to benchmark medical llm agents.Nejm Ai, 2(9):AIdbp2500144,

2025

[11] [11]

BioMistral: A collection of open-source pretrained large language models for medical domains

[Labraket al., 2024 ] Yanis Labrak, Adrien Bazoge, Em- manuel Morin, Pierre-Antoine Gourraud, Mickael Rou- vier, and Richard Dufour. BioMistral: A collection of open-source pretrained large language models for medical domains. InFindings of the Association for Computational Linguistics: ACL 2024, pages 5848–5864, Bangkok, Thai- land, August

2024

[12] [12]

[Laiet al., 2025 ] Yuxiang Lai, Jike Zhong, Ming Li, Shi- tian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang

Association for Computational Lin- guistics. [Laiet al., 2025 ] Yuxiang Lai, Jike Zhong, Ming Li, Shi- tian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang. Med-R1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Jour- nals & Magazine,

2025

[13] [13]

Llava- med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

[Liet al., 2023 ] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava- med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,

2023

[14] [14]

MMedAgent: Learning to use medical tools with multi-modal agent

[Liet al., 2024 ] Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, and Yixin Wang. MMedAgent: Learning to use medical tools with multi-modal agent. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, Stroudsburg, PA, USA,

2024

[15] [15]

Association for Computational Linguistics. [Liet al., 2025 ] Feiyang Li, Yingjian Chen, Haoran Liu, Rui Yang, Han Yuan, Yuang Jiang, Tianxiao Li, Edison Mar- rese Taylor, Hossein Rouhizadeh, Yusuke Iwasawa, Dou- glas Teodoro, Yutaka Matsuo, and Irene Li. MKG-Rank: Enhancing large language models with knowledge graph for multilingual medical question answering,

2025

[16] [16]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,

[Liuet al., 2023 ] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916,

2023

[17] [17]

Drishtikon: A multi- modal multilingual benchmark for testing language mod- els’ understanding on indian culture

[Majiet al., 2025 ] Arijit Maji, Raghvendra Kumar, Akash Ghosh, Nemil Shah, Abhilekh Borah, Vanshika Shah, Nis- hant Mishra, Sriparna Saha, et al. Drishtikon: A multi- modal multilingual benchmark for testing language mod- els’ understanding on indian culture. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, page...

2025

[18] [18]

The future of MLLM prompting is adaptive: A comprehensive ex- perimental evaluation of prompt engineering methods for robust multimodal performance,

[Mohantyet al., 2025 ] Anwesha Mohanty, Venkatesh Bal- avadhani Parthasarathy, and Arsalan Shahid. The future of MLLM prompting is adaptive: A comprehensive ex- perimental evaluation of prompt engineering methods for robust multimodal performance,

2025

[19] [19]

Scaling neural machine translation to 200 languages.Nature, 630(8018):841–846,

[NLLB Team, 2024] NLLB Team. Scaling neural machine translation to 200 languages.Nature, 630(8018):841–846,

2024

[20] [20]

Cure-med: Curriculum-informed reinforcement learning for multilingual medical reasoning,

[Onyameet al., 2026 ] Eric Onyame, Akash Ghosh, Sub- hadip Baidya, Sriparna Saha, Xiuying Chen, and Chirag Agarwal. Cure-med: Curriculum-informed reinforcement learning for multilingual medical reasoning,

2026

[21] [21]

MedVLM-R1: Incentiviz- ing medical reasoning capability of vision-language mod- els (VLMs) via reinforcement learning

[Panet al., 2025 ] Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. MedVLM-R1: Incentiviz- ing medical reasoning capability of vision-language mod- els (VLMs) via reinforcement learning. InMedical Image Computing and Computer-Assisted Intervention – MIC- CAI

2025

[22] [22]

A survey of multilingual large language models.Patterns, 6(1),

[Qinet al., 2025 ] Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanxiang Che, and Philip S Yu. A survey of multilingual large language models.Patterns, 6(1),

2025

[23] [23]

Towards building multilingual language model for medicine.Nat

[Qiuet al., 2024 ] Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yan- feng Wang, and Weidi Xie. Towards building multilingual language model for medicine.Nat. Commun., 15(1):8384, September

2024

[24] [24]

A multilingual multimodal medical examina- tion dataset for visual question answering in healthcare

[Riccioet al., 2025 ] Giuseppe Riccio, Antonio Romano, Mariano Barone, Gian Marco Orlando, Diego Russo, Marco Postiglione, Valerio La Gatta, and Vincenzo Moscato. A multilingual multimodal medical examina- tion dataset for visual question answering in healthcare. In 2025 IEEE 38th International Symposium on Computer- Based Medical Systems (CBMS), pages 435–440,

2025

[25] [25]

AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments,

[Schmidgallet al., 2024 ] Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments,

2024

[26] [26]

MedGemma technical report,

[Sellergrenet al., 2025 ] Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C´ıan Hughes, Charles Lau, et al. MedGemma technical report,

2025

[27] [27]

Chiron-o1: Igniting multimodal large language models towards generalizable medical rea- soning via mentor-intern collaborative search

[Sunet al., 2025 ] Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, and Xiaosong Wang. Chiron-o1: Igniting multimodal large language models towards generalizable medical rea- soning via mentor-intern collaborative search. InAdvances in Neural Information Processing Systems,

2025

[28] [28]

Vi- raasat: Traversing novel paths for indian cultural reason- ing,

[Suranaet al., 2026 ] Harshul Raj Surana, Arijit Maji, Aryan Vats, Akash Ghosh, Sriparna Saha, and Amit Sheth. Vi- raasat: Traversing novel paths for indian cultural reason- ing,

2026

[29] [29]

Medagents: Large language models as collaborators for zero-shot medical reasoning

[Tanget al., 2024 ] Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InFindings of the Association for Computational Linguis- tics: ACL 2024, pages 599–621,

2024

[30] [30]

Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,

[Wanget al., 2024 ] Peng Wang, Shuai Bai, Sinan Tan, Shi- jie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,

2024

[31] [31]

MMedAgent-RL: Opti- mizing multi-agent collaboration for multimodal medical reasoning

[Xiaet al., 2026 ] Peng Xia, Jinglu Wang, Yibo Peng, Kaide Zeng, Xian Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Shu- jie Liu, Yan Lu, and Huaxiu Yao. MMedAgent-RL: Opti- mizing multi-agent collaboration for multimodal medical reasoning. InInternational Conference on Learning Rep- resentations,

2026

[32] [32]

Mul- timodal large language models for medical image diag- nosis: Challenges and opportunities.J

[Zhanget al., 2025 ] Andrew Zhang, Eric Zhao, Ruirui Wang, Xiuqi Zhang, Justin Wang, and Ethan Chen. Mul- timodal large language models for medical image diag- nosis: Challenges and opportunities.J. Biomed. Inform., 169(104895):104895, September

2025

[33] [33]

Reinforced visual per- ception with tools, 2025

[Zhouet al., 2025b ] Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual per- ception with tools, 2025

2025