How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

Intesar Tahmid; Md Fahim; Md Farhad Alam Bhuiyan; Mir Sazzat Hossain; Rafid Ahmed; Tasnimul Hossain Tomal

arxiv: 2605.18111 · v1 · pith:LTX6WV2Pnew · submitted 2026-05-18 · 💻 cs.CL · cs.CV

How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

Rafid Ahmed , Intesar Tahmid , Mir Sazzat Hossain , Tasnimul Hossain Tomal , Md Fahim , Md Farhad Alam Bhuiyan This is my paper

Pith reviewed 2026-05-20 10:55 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords modelsmedicalbanglamedvqaquestionsansweringclinicallycomplex

0 comments

The pith

Foundation models perform substantially lower on Bangla medical visual questions than on English benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BanglaMedVQA, the first dataset of clinically validated medical images paired with questions and answers in Bangla. Evaluation of current large vision-language models on this dataset shows markedly worse results than on established English medical benchmarks. Even leading systems such as Gemini and GPT-4.1 mini fail on questions requiring precise diagnostic reasoning. A sympathetic reader would care because Bangla is spoken by over 250 million people and weak performance limits the potential for AI-assisted medical support in those regions.

Core claim

The authors create BanglaMedVQA with clinically validated image-question-answer pairs and demonstrate through evaluation that current foundation models exhibit substantially lower performance on Bangla medical visual questions compared to English benchmarks. Even top models fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning for low-resource languages.

What carries the argument

BanglaMedVQA dataset of clinically validated image-question-answer pairs, used to benchmark foundation models and reveal language-specific performance gaps in medical visual question answering.

If this is right

Performance remains especially poor on specialized diagnostic questions across all tested models.
Certain open-source models occasionally match closed models on general categories but still fail on complex clinical questions.
The results underscore the urgent need for improved evaluation methods suited to low-resource medical domains.
Bangla performance gaps reflect broader challenges inherent to low-resource languages in medical reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks for other low-resource languages could expose parallel gaps in medical AI capabilities.
Real-world deployment tests in Bangla-speaking clinics would reveal whether the benchmark gaps translate to practical diagnostic errors.
Targeted fine-tuning on Bangla medical data offers a direct way to test whether the observed limitations can be reduced.
The dataset could serve as a starting point for comparing multilingual medical VQA progress across additional languages.
keywords:[

Load-bearing premise

The image-question-answer pairs accurately represent real clinical scenarios in Bangla-speaking regions and the evaluation protocol isolates language limitations rather than dataset artifacts or prompting choices.

What would settle it

A model achieving accuracy on BanglaMedVQA comparable to its English MedVQA scores after targeted Bangla medical fine-tuning would challenge the claim of inherent low-resource limitations.

Figures

Figures reproduced from arXiv: 2605.18111 by Intesar Tahmid, Md Fahim, Md Farhad Alam Bhuiyan, Mir Sazzat Hossain, Rafid Ahmed, Tasnimul Hossain Tomal.

**Figure 1.** Figure 1: Workflow of the dataset curation process. Images and metadata were obtained from two widely used biomedical [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Distributions of clinical conditions and question keywords in the curated dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: LAVE score comparison of the Chest X-Ray dataset for different models under four different settings: vanilla (baseline [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: LAVE score comparison of the MedICat dataset for different models under four different settings: vanilla (baseline [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: LAVE score comparison on the Chest X-Ray dataset across categorical question types with chain-of-thought reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: LAVE score comparison on the MedICat dataset across categorical question types with chain-of-thought reasoning in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Error Analysis of Medical Visual Question Answering (MedVQA) pairs from the proposed dataset, showcasing [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces BanglaMedVQA, a new dataset of clinically validated Bangla-language medical image-question-answer pairs, and reports a comprehensive benchmark of closed-source and open-source LVLMs on this resource. The central claim is that model performance on BanglaMedVQA is substantially lower than on existing English MedVQA benchmarks, with even the strongest models (Gemini, GPT-4.1-mini) failing on specialized diagnostic questions.

Significance. If the dataset construction and evaluation controls are sound, the work supplies the first public benchmark for Bangla MedVQA and quantifies the additional difficulty current foundation models face in low-resource-language medical visual reasoning. It could usefully motivate targeted data collection or fine-tuning efforts for Bangla medical applications.

major comments (3)

[Dataset Construction] Dataset section: the claim that the image-question-answer pairs are 'clinically validated' is not supported by any reported details on the number or qualifications of medical experts, the validation protocol, or inter-annotator agreement statistics. Without these, it is impossible to determine whether the observed performance gap reflects language-specific limitations or properties of the dataset construction itself.
[Evaluation Protocol] Evaluation section: the paper does not specify the exact prompting templates (zero-shot vs. few-shot, language of the prompt, presence of chain-of-thought), the answer-extraction procedure, or the precise metric definitions used for each model. These omissions are load-bearing for the claim that the gap isolates language effects rather than prompting or parsing artifacts.
[Results and Comparison] Results section: the assertion of 'substantially lower' performance relative to English MedVQA benchmarks does not identify the reference English dataset(s) or demonstrate that question complexity, image distribution, and answer-type balance are matched. This weakens the attribution of the gap to low-resource language challenges.

minor comments (1)

[Abstract] Abstract: the model identifier 'GPT-4.1 mini' is non-standard; clarify whether this refers to GPT-4o-mini or another variant.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our work.

read point-by-point responses

Referee: [Dataset Construction] Dataset section: the claim that the image-question-answer pairs are 'clinically validated' is not supported by any reported details on the number or qualifications of medical experts, the validation protocol, or inter-annotator agreement statistics. Without these, it is impossible to determine whether the observed performance gap reflects language-specific limitations or properties of the dataset construction itself.

Authors: We agree that the original manuscript provided insufficient detail on the clinical validation process. The dataset was constructed with input from three licensed physicians (two with specialization in diagnostic radiology and one in internal medicine), using a protocol of independent review by each expert followed by a consensus discussion for disagreements. Inter-annotator agreement reached a Cohen's kappa of 0.81 on a held-out sample of 150 pairs. We will add a dedicated paragraph in the revised Dataset section describing the experts' qualifications, the full validation protocol, and the agreement statistics to better substantiate the clinical validation claim. revision: yes
Referee: [Evaluation Protocol] Evaluation section: the paper does not specify the exact prompting templates (zero-shot vs. few-shot, language of the prompt, presence of chain-of-thought), the answer-extraction procedure, or the precise metric definitions used for each model. These omissions are load-bearing for the claim that the gap isolates language effects rather than prompting or parsing artifacts.

Authors: We acknowledge that these protocol details were not fully specified. In the revision, we will insert a new Evaluation Protocol subsection clarifying that all models were evaluated zero-shot using standardized English prompts (to enable fair cross-lingual comparison), without chain-of-thought instructions in the main results. Answer extraction uses rule-based parsing to isolate the final answer token or phrase after stripping explanatory text, with exact-match accuracy for closed-ended questions and a combination of BLEU-4 and ROUGE-L for open-ended responses. These additions will allow readers to assess whether the reported gaps are attributable to language rather than implementation choices. revision: yes
Referee: [Results and Comparison] Results section: the assertion of 'substantially lower' performance relative to English MedVQA benchmarks does not identify the reference English dataset(s) or demonstrate that question complexity, image distribution, and answer-type balance are matched. This weakens the attribution of the gap to low-resource language challenges.

Authors: We thank the referee for this point. The primary comparisons were to VQA-RAD and SLAKE as representative English MedVQA benchmarks. We recognize that explicit distributional matching was not provided. In the revised Results section, we will include a new table summarizing question-type distributions (diagnostic, descriptive, etc.), image modalities, and answer-length statistics for BanglaMedVQA versus the English references. While perfect matching across all dimensions is not feasible given the distinct clinical contexts, this will better support our interpretation of the performance differences as reflecting low-resource language challenges while transparently noting remaining dataset differences. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical dataset benchmarking

full rationale

The paper introduces BanglaMedVQA as a new clinically validated dataset and reports direct model evaluation results on it. No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims rest on observable model outputs against external English MedVQA benchmarks rather than self-referential definitions or self-citation chains. The work is self-contained and externally falsifiable via the released dataset and standard prompting protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on the assumption that clinically validated pairs can be created for Bangla medical images and that standard VQA evaluation metrics transfer meaningfully to this low-resource setting; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Clinically validated image-question-answer pairs can be reliably constructed for Bangla medical contexts
Invoked in the dataset introduction and evaluation claims
domain assumption Performance differences between Bangla and English MedVQA reflect inherent language-resource challenges rather than dataset construction artifacts
Central to the comparison and conclusion about low-resource languages

pith-pipeline@v0.9.0 · 5761 in / 1376 out tokens · 19178 ms · 2026-05-20T10:55:33.630728+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce BanglaMedVQA, a dataset comprising clinically validated image–question–answer pairs... overall accuracies of 40.38% and 26.50%... performance on specialized diagnostic categories such as Condition/Finding and Position falls below random chance
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The dataset contains 1,374 unique image–caption pairs... validated by two certified physicians... 97% acceptance rate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 9 internal anchors

[1]

Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education

Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)

work page
[2]

Classification Problem Solving

Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence

work page
[3]

, title =

Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =

work page 1980
[4]

New Ways to Make Microcircuits Smaller---Duplicate Entry

Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science

work page
[5]

Clancey and Glenn Rennels , abstract =

Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =

work page doi:10.1016/s0020-7373(84)80003-6 1984
[6]

and Rennels, Glenn R

Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies

work page
[7]

Poligon: A System for Parallel Problem Solving

Rice, James. Poligon: A System for Parallel Problem Solving

work page
[8]

Transfer of Rule-Based Expertise through a Tutorial Dialogue

Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue

work page
[9]

The Engineering of Qualitative Models

Clancey, William J. The Engineering of Qualitative Models

work page
[10]

2017 , eprint=

Attention Is All You Need , author=. 2017 , eprint=

work page 2017
[11]

Pluto: The 'Other' Red Planet

NASA. Pluto: The 'Other' Red Planet

work page
[12]

Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

SentNoB: A dataset for analysing sentiment on noisy Bangla texts , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

work page 2021
[13]

ArXiv , year=

MedICaT: A Dataset of Medical Images, Captions, and Textual References , author=. ArXiv , year=

work page
[14]

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases , author=. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2017
[15]

Lawrence and Parikh, Devi , title =

Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =

work page
[16]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page
[17]

and Singh, Saurabh and Hoiem, Derek , booktitle=

Shih, Kevin J. and Singh, Saurabh and Hoiem, Derek , booktitle=. Where to Look: Focus Regions for Visual Question Answering , year=

work page
[18]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

KVQA: Knowledge-Aware Visual Question Answering , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2019 , month=. doi:10.1609/aaai.v33i01.33018876 , number=

work page doi:10.1609/aaai.v33i01.33018876 2019
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

Gu, Tiancheng and Yang, Kaicheng and Liu, Dongnan and Cai, Weidong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =. 2024 , pages =

work page 2024
[20]

proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , year =

Yim, Wen-wai and Fu, Yujuan and Sun, Zhaoyi and Ben Abacha, Asma and Yetisgen, Meliha and Xia, Fei , title =. proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , year =

work page 2024
[21]

M ed T hink: A Rationale-Guided Framework for Explaining Medical Visual Question Answering

Gai, Xiaotang and Zhou, Chenyi and Liu, Jiaxiang and Feng, Yang and Wu, Jian and Liu, Zuozhu. M ed T hink: A Rationale-Guided Framework for Explaining Medical Visual Question Answering. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.415

work page doi:10.18653/v1/2025.findings-naacl.415 2025
[22]

ML e VLM : Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering

Xu, Dexuan and Chen, Yanyuan and Wang, Jieyi and Huang, Yue and Wang, Hanpin and Jin, Zhi and Wang, Hongxing and Yue, Weihua and He, Jing and Li, Hang and Huang, Yu. ML e VLM : Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering. Findings of the Association for Computational Linguisti...

work page doi:10.18653/v1/2024.findings-acl.296 2024
[23]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i9.33047 , number=

work page doi:10.1609/aaai.v39i9.33047 2025
[24]

2025 , eprint=

Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement , author=. 2025 , eprint=

work page 2025
[25]

Farhad Alam , title=

Deeparghya Dutta Barua and Md Sakib Ul Rahman Sourove and Md Farhan Ishmam and Fabiha Haider and Fariha Tanjim Shifat and Md Fahim and Md. Farhad Alam , title=. CoRR , volume=. 2024 , cdate=

work page 2024
[26]

Rafi, Mahamudul Hasan and Islam, Shifat and Hasan Imtiaz Labib, S. M. and Hasan, SM Sajid and Shah, Faisal Muhammad and Ahmed, Sifat , booktitle=. A Deep Learning-Based Bengali Visual Question Answering System , year=

work page
[27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[28]

MedGemma Technical Report

Medgemma technical report , author=. arXiv preprint arXiv:2507.05201 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page
[32]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Qwen2.5-Omni Technical Report

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[37]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Improved Baselines with Visual Instruction Tuning , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2024
[38]

ArXiv , year=

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , author=. ArXiv , year=

work page
[39]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Information and Software Technology , volume=

A survey on dataset quality in machine learning , author=. Information and Software Technology , volume=. 2023 , publisher=

work page 2023
[42]

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , year=

Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle=. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , year=

work page
[43]

European conference on computer vision , pages=

A-okvqa: A benchmark for visual question answering using world knowledge , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022
[44]

arXiv preprint arXiv:2405.20421 , year=

Worse than random? an embarrassingly simple probing evaluation of large multimodal models in medical vqa , author=. arXiv preprint arXiv:2405.20421 , year=

work page arXiv
[45]

2025 , howpublished =

Med VQA BN Overall , author =. 2025 , howpublished =

work page 2025
[46]

IEEE Reviews in Biomedical Engineering , volume=

Automated radiology report generation: A review of recent advances , author=. IEEE Reviews in Biomedical Engineering , volume=. 2024 , publisher=

work page 2024
[47]

2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=. 2021 , organization=

work page 2021
[48]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Pathvqa: 30000+ questions for medical visual question answering , author=. arXiv preprint arXiv:2003.10286 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003
[49]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Pmc-vqa: Visual instruction tuning for medical visual question answering , author=. arXiv preprint arXiv:2305.10415 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Improving automatic vqa evaluation using large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[51]

arXiv preprint arXiv:2406.06331 , year=

MedExQA: Medical question answering benchmark with multiple explanations , author=. arXiv preprint arXiv:2406.06331 , year=

work page arXiv
[52]

arXiv preprint arXiv:2404.15149 , year=

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study , author=. arXiv preprint arXiv:2404.15149 , year=

work page arXiv
[53]

arXiv preprint arXiv:2401.13081 , year=

Free form medical visual question answering in radiology , author=. arXiv preprint arXiv:2401.13081 , year=

work page arXiv
[54]

A generalisation of Fleiss' kappa , author=

Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa , author=. arXiv preprint arXiv:2303.12502 , year=

work page arXiv
[55]

2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) , pages=

Self-supervised vision-language pretraining for medial visual question answering , author=. 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) , pages=. 2023 , organization=

work page 2023

[1] [1]

Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education

Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)

work page

[2] [2]

Classification Problem Solving

Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence

work page

[3] [3]

, title =

Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =

work page 1980

[4] [4]

New Ways to Make Microcircuits Smaller---Duplicate Entry

Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science

work page

[5] [5]

Clancey and Glenn Rennels , abstract =

Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =

work page doi:10.1016/s0020-7373(84)80003-6 1984

[6] [6]

and Rennels, Glenn R

Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies

work page

[7] [7]

Poligon: A System for Parallel Problem Solving

Rice, James. Poligon: A System for Parallel Problem Solving

work page

[8] [8]

Transfer of Rule-Based Expertise through a Tutorial Dialogue

Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue

work page

[9] [9]

The Engineering of Qualitative Models

Clancey, William J. The Engineering of Qualitative Models

work page

[10] [10]

2017 , eprint=

Attention Is All You Need , author=. 2017 , eprint=

work page 2017

[11] [11]

Pluto: The 'Other' Red Planet

NASA. Pluto: The 'Other' Red Planet

work page

[12] [12]

Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

SentNoB: A dataset for analysing sentiment on noisy Bangla texts , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

work page 2021

[13] [13]

ArXiv , year=

MedICaT: A Dataset of Medical Images, Captions, and Textual References , author=. ArXiv , year=

work page

[14] [14]

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases , author=. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2017

[15] [15]

Lawrence and Parikh, Devi , title =

Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =

work page

[16] [16]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page

[17] [17]

and Singh, Saurabh and Hoiem, Derek , booktitle=

Shih, Kevin J. and Singh, Saurabh and Hoiem, Derek , booktitle=. Where to Look: Focus Regions for Visual Question Answering , year=

work page

[18] [18]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

KVQA: Knowledge-Aware Visual Question Answering , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2019 , month=. doi:10.1609/aaai.v33i01.33018876 , number=

work page doi:10.1609/aaai.v33i01.33018876 2019

[19] [19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

Gu, Tiancheng and Yang, Kaicheng and Liu, Dongnan and Cai, Weidong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =. 2024 , pages =

work page 2024

[20] [20]

proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , year =

Yim, Wen-wai and Fu, Yujuan and Sun, Zhaoyi and Ben Abacha, Asma and Yetisgen, Meliha and Xia, Fei , title =. proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , year =

work page 2024

[21] [21]

M ed T hink: A Rationale-Guided Framework for Explaining Medical Visual Question Answering

Gai, Xiaotang and Zhou, Chenyi and Liu, Jiaxiang and Feng, Yang and Wu, Jian and Liu, Zuozhu. M ed T hink: A Rationale-Guided Framework for Explaining Medical Visual Question Answering. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.415

work page doi:10.18653/v1/2025.findings-naacl.415 2025

[22] [22]

ML e VLM : Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering

Xu, Dexuan and Chen, Yanyuan and Wang, Jieyi and Huang, Yue and Wang, Hanpin and Jin, Zhi and Wang, Hongxing and Yue, Weihua and He, Jing and Li, Hang and Huang, Yu. ML e VLM : Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering. Findings of the Association for Computational Linguisti...

work page doi:10.18653/v1/2024.findings-acl.296 2024

[23] [23]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i9.33047 , number=

work page doi:10.1609/aaai.v39i9.33047 2025

[24] [24]

2025 , eprint=

Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement , author=. 2025 , eprint=

work page 2025

[25] [25]

Farhad Alam , title=

Deeparghya Dutta Barua and Md Sakib Ul Rahman Sourove and Md Farhan Ishmam and Fabiha Haider and Fariha Tanjim Shifat and Md Fahim and Md. Farhad Alam , title=. CoRR , volume=. 2024 , cdate=

work page 2024

[26] [26]

Rafi, Mahamudul Hasan and Islam, Shifat and Hasan Imtiaz Labib, S. M. and Hasan, SM Sajid and Shah, Faisal Muhammad and Ahmed, Sifat , booktitle=. A Deep Learning-Based Bengali Visual Question Answering System , year=

work page

[27] [27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[28] [28]

MedGemma Technical Report

Medgemma technical report , author=. arXiv preprint arXiv:2507.05201 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [31]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page

[31] [32]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [33]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [34]

Qwen2.5-Omni Technical Report

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [35]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

work page

[35] [36]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[36] [37]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Improved Baselines with Visual Instruction Tuning , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2024

[37] [38]

ArXiv , year=

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , author=. ArXiv , year=

work page

[38] [39]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [40]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [41]

Information and Software Technology , volume=

A survey on dataset quality in machine learning , author=. Information and Software Technology , volume=. 2023 , publisher=

work page 2023

[41] [42]

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , year=

Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle=. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , year=

work page

[42] [43]

European conference on computer vision , pages=

A-okvqa: A benchmark for visual question answering using world knowledge , author=. European conference on computer vision , pages=. 2022 , organization=

work page 2022

[43] [44]

arXiv preprint arXiv:2405.20421 , year=

Worse than random? an embarrassingly simple probing evaluation of large multimodal models in medical vqa , author=. arXiv preprint arXiv:2405.20421 , year=

work page arXiv

[44] [45]

2025 , howpublished =

Med VQA BN Overall , author =. 2025 , howpublished =

work page 2025

[45] [46]

IEEE Reviews in Biomedical Engineering , volume=

Automated radiology report generation: A review of recent advances , author=. IEEE Reviews in Biomedical Engineering , volume=. 2024 , publisher=

work page 2024

[46] [47]

2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=. 2021 , organization=

work page 2021

[47] [48]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Pathvqa: 30000+ questions for medical visual question answering , author=. arXiv preprint arXiv:2003.10286 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2003

[48] [49]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Pmc-vqa: Visual instruction tuning for medical visual question answering , author=. arXiv preprint arXiv:2305.10415 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [50]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Improving automatic vqa evaluation using large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[50] [51]

arXiv preprint arXiv:2406.06331 , year=

MedExQA: Medical question answering benchmark with multiple explanations , author=. arXiv preprint arXiv:2406.06331 , year=

work page arXiv

[51] [52]

arXiv preprint arXiv:2404.15149 , year=

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study , author=. arXiv preprint arXiv:2404.15149 , year=

work page arXiv

[52] [53]

arXiv preprint arXiv:2401.13081 , year=

Free form medical visual question answering in radiology , author=. arXiv preprint arXiv:2401.13081 , year=

work page arXiv

[53] [54]

A generalisation of Fleiss' kappa , author=

Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa , author=. arXiv preprint arXiv:2303.12502 , year=

work page arXiv

[54] [55]

2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) , pages=

Self-supervised vision-language pretraining for medial visual question answering , author=. 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) , pages=. 2023 , organization=

work page 2023