How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking
Pith reviewed 2026-05-20 10:55 UTC · model grok-4.3
The pith
Foundation models perform substantially lower on Bangla medical visual questions than on English benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors create BanglaMedVQA with clinically validated image-question-answer pairs and demonstrate through evaluation that current foundation models exhibit substantially lower performance on Bangla medical visual questions compared to English benchmarks. Even top models fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning for low-resource languages.
What carries the argument
BanglaMedVQA dataset of clinically validated image-question-answer pairs, used to benchmark foundation models and reveal language-specific performance gaps in medical visual question answering.
If this is right
- Performance remains especially poor on specialized diagnostic questions across all tested models.
- Certain open-source models occasionally match closed models on general categories but still fail on complex clinical questions.
- The results underscore the urgent need for improved evaluation methods suited to low-resource medical domains.
- Bangla performance gaps reflect broader challenges inherent to low-resource languages in medical reasoning tasks.
Where Pith is reading between the lines
- Similar benchmarks for other low-resource languages could expose parallel gaps in medical AI capabilities.
- Real-world deployment tests in Bangla-speaking clinics would reveal whether the benchmark gaps translate to practical diagnostic errors.
- Targeted fine-tuning on Bangla medical data offers a direct way to test whether the observed limitations can be reduced.
- The dataset could serve as a starting point for comparing multilingual medical VQA progress across additional languages.
- keywords:[
Load-bearing premise
The image-question-answer pairs accurately represent real clinical scenarios in Bangla-speaking regions and the evaluation protocol isolates language limitations rather than dataset artifacts or prompting choices.
What would settle it
A model achieving accuracy on BanglaMedVQA comparable to its English MedVQA scores after targeted Bangla medical fine-tuning would challenge the claim of inherent low-resource limitations.
Figures
read the original abstract
Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BanglaMedVQA, a new dataset of clinically validated Bangla-language medical image-question-answer pairs, and reports a comprehensive benchmark of closed-source and open-source LVLMs on this resource. The central claim is that model performance on BanglaMedVQA is substantially lower than on existing English MedVQA benchmarks, with even the strongest models (Gemini, GPT-4.1-mini) failing on specialized diagnostic questions.
Significance. If the dataset construction and evaluation controls are sound, the work supplies the first public benchmark for Bangla MedVQA and quantifies the additional difficulty current foundation models face in low-resource-language medical visual reasoning. It could usefully motivate targeted data collection or fine-tuning efforts for Bangla medical applications.
major comments (3)
- [Dataset Construction] Dataset section: the claim that the image-question-answer pairs are 'clinically validated' is not supported by any reported details on the number or qualifications of medical experts, the validation protocol, or inter-annotator agreement statistics. Without these, it is impossible to determine whether the observed performance gap reflects language-specific limitations or properties of the dataset construction itself.
- [Evaluation Protocol] Evaluation section: the paper does not specify the exact prompting templates (zero-shot vs. few-shot, language of the prompt, presence of chain-of-thought), the answer-extraction procedure, or the precise metric definitions used for each model. These omissions are load-bearing for the claim that the gap isolates language effects rather than prompting or parsing artifacts.
- [Results and Comparison] Results section: the assertion of 'substantially lower' performance relative to English MedVQA benchmarks does not identify the reference English dataset(s) or demonstrate that question complexity, image distribution, and answer-type balance are matched. This weakens the attribution of the gap to low-resource language challenges.
minor comments (1)
- [Abstract] Abstract: the model identifier 'GPT-4.1 mini' is non-standard; clarify whether this refers to GPT-4o-mini or another variant.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our work.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset section: the claim that the image-question-answer pairs are 'clinically validated' is not supported by any reported details on the number or qualifications of medical experts, the validation protocol, or inter-annotator agreement statistics. Without these, it is impossible to determine whether the observed performance gap reflects language-specific limitations or properties of the dataset construction itself.
Authors: We agree that the original manuscript provided insufficient detail on the clinical validation process. The dataset was constructed with input from three licensed physicians (two with specialization in diagnostic radiology and one in internal medicine), using a protocol of independent review by each expert followed by a consensus discussion for disagreements. Inter-annotator agreement reached a Cohen's kappa of 0.81 on a held-out sample of 150 pairs. We will add a dedicated paragraph in the revised Dataset section describing the experts' qualifications, the full validation protocol, and the agreement statistics to better substantiate the clinical validation claim. revision: yes
-
Referee: [Evaluation Protocol] Evaluation section: the paper does not specify the exact prompting templates (zero-shot vs. few-shot, language of the prompt, presence of chain-of-thought), the answer-extraction procedure, or the precise metric definitions used for each model. These omissions are load-bearing for the claim that the gap isolates language effects rather than prompting or parsing artifacts.
Authors: We acknowledge that these protocol details were not fully specified. In the revision, we will insert a new Evaluation Protocol subsection clarifying that all models were evaluated zero-shot using standardized English prompts (to enable fair cross-lingual comparison), without chain-of-thought instructions in the main results. Answer extraction uses rule-based parsing to isolate the final answer token or phrase after stripping explanatory text, with exact-match accuracy for closed-ended questions and a combination of BLEU-4 and ROUGE-L for open-ended responses. These additions will allow readers to assess whether the reported gaps are attributable to language rather than implementation choices. revision: yes
-
Referee: [Results and Comparison] Results section: the assertion of 'substantially lower' performance relative to English MedVQA benchmarks does not identify the reference English dataset(s) or demonstrate that question complexity, image distribution, and answer-type balance are matched. This weakens the attribution of the gap to low-resource language challenges.
Authors: We thank the referee for this point. The primary comparisons were to VQA-RAD and SLAKE as representative English MedVQA benchmarks. We recognize that explicit distributional matching was not provided. In the revised Results section, we will include a new table summarizing question-type distributions (diagnostic, descriptive, etc.), image modalities, and answer-length statistics for BanglaMedVQA versus the English references. While perfect matching across all dimensions is not feasible given the distinct clinical contexts, this will better support our interpretation of the performance differences as reflecting low-resource language challenges while transparently noting remaining dataset differences. revision: partial
Circularity Check
No significant circularity in empirical dataset benchmarking
full rationale
The paper introduces BanglaMedVQA as a new clinically validated dataset and reports direct model evaluation results on it. No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims rest on observable model outputs against external English MedVQA benchmarks rather than self-referential definitions or self-citation chains. The work is self-contained and externally falsifiable via the released dataset and standard prompting protocols.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Clinically validated image-question-answer pairs can be reliably constructed for Bangla medical contexts
- domain assumption Performance differences between Bangla and English MedVQA reflect inherent language-resource challenges rather than dataset construction artifacts
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce BanglaMedVQA, a dataset comprising clinically validated image–question–answer pairs... overall accuracies of 40.38% and 26.50%... performance on specialized diagnostic categories such as Condition/Finding and Position falls below random chance
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The dataset contains 1,374 unique image–caption pairs... validated by two certified physicians... 97% acceptance rate
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)
-
[2]
Classification Problem Solving
Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence
- [3]
-
[4]
New Ways to Make Microcircuits Smaller---Duplicate Entry
Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science
-
[5]
Clancey and Glenn Rennels , abstract =
Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =
-
[6]
Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies
-
[7]
Poligon: A System for Parallel Problem Solving
Rice, James. Poligon: A System for Parallel Problem Solving
-
[8]
Transfer of Rule-Based Expertise through a Tutorial Dialogue
Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue
-
[9]
The Engineering of Qualitative Models
Clancey, William J. The Engineering of Qualitative Models
- [10]
- [11]
-
[12]
Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=
SentNoB: A dataset for analysing sentiment on noisy Bangla texts , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=
work page 2021
-
[13]
MedICaT: A Dataset of Medical Images, Captions, and Textual References , author=. ArXiv , year=
-
[14]
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases , author=. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2017
-
[15]
Lawrence and Parikh, Devi , title =
Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =
-
[16]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
-
[17]
and Singh, Saurabh and Hoiem, Derek , booktitle=
Shih, Kevin J. and Singh, Saurabh and Hoiem, Derek , booktitle=. Where to Look: Focus Regions for Visual Question Answering , year=
-
[18]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
KVQA: Knowledge-Aware Visual Question Answering , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2019 , month=. doi:10.1609/aaai.v33i01.33018876 , number=
-
[19]
Gu, Tiancheng and Yang, Kaicheng and Liu, Dongnan and Cai, Weidong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =. 2024 , pages =
work page 2024
-
[20]
proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , year =
Yim, Wen-wai and Fu, Yujuan and Sun, Zhaoyi and Ben Abacha, Asma and Yetisgen, Meliha and Xia, Fei , title =. proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , year =
work page 2024
-
[21]
M ed T hink: A Rationale-Guided Framework for Explaining Medical Visual Question Answering
Gai, Xiaotang and Zhou, Chenyi and Liu, Jiaxiang and Feng, Yang and Wu, Jian and Liu, Zuozhu. M ed T hink: A Rationale-Guided Framework for Explaining Medical Visual Question Answering. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.415
-
[22]
Xu, Dexuan and Chen, Yanyuan and Wang, Jieyi and Huang, Yue and Wang, Hanpin and Jin, Zhi and Wang, Hongxing and Yue, Weihua and He, Jing and Li, Hang and Huang, Yu. ML e VLM : Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering. Findings of the Association for Computational Linguisti...
-
[23]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i9.33047 , number=
-
[24]
Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement , author=. 2025 , eprint=
work page 2025
-
[25]
Deeparghya Dutta Barua and Md Sakib Ul Rahman Sourove and Md Farhan Ishmam and Fabiha Haider and Fariha Tanjim Shifat and Md Fahim and Md. Farhad Alam , title=. CoRR , volume=. 2024 , cdate=
work page 2024
-
[26]
Rafi, Mahamudul Hasan and Islam, Shifat and Hasan Imtiaz Labib, S. M. and Hasan, SM Sajid and Shah, Faisal Muhammad and Ahmed, Sifat , booktitle=. A Deep Learning-Based Bengali Visual Question Answering System , year=
-
[27]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[28]
Medgemma technical report , author=. arXiv preprint arXiv:2507.05201 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [31]
-
[32]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Advances in Neural Information Processing Systems , volume=
Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
International conference on machine learning , pages=
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[37]
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Improved Baselines with Visual Instruction Tuning , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
work page 2024
-
[38]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , author=. ArXiv , year=
-
[39]
Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Information and Software Technology , volume=
A survey on dataset quality in machine learning , author=. Information and Software Technology , volume=. 2023 , publisher=
work page 2023
-
[42]
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , year=
Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle=. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , year=
-
[43]
European conference on computer vision , pages=
A-okvqa: A benchmark for visual question answering using world knowledge , author=. European conference on computer vision , pages=. 2022 , organization=
work page 2022
-
[44]
arXiv preprint arXiv:2405.20421 , year=
Worse than random? an embarrassingly simple probing evaluation of large multimodal models in medical vqa , author=. arXiv preprint arXiv:2405.20421 , year=
- [45]
-
[46]
IEEE Reviews in Biomedical Engineering , volume=
Automated radiology report generation: A review of recent advances , author=. IEEE Reviews in Biomedical Engineering , volume=. 2024 , publisher=
work page 2024
-
[47]
2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=
Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=. 2021 , organization=
work page 2021
-
[48]
PathVQA: 30000+ Questions for Medical Visual Question Answering
Pathvqa: 30000+ questions for medical visual question answering , author=. arXiv preprint arXiv:2003.10286 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[49]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Pmc-vqa: Visual instruction tuning for medical visual question answering , author=. arXiv preprint arXiv:2305.10415 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Improving automatic vqa evaluation using large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[51]
arXiv preprint arXiv:2406.06331 , year=
MedExQA: Medical question answering benchmark with multiple explanations , author=. arXiv preprint arXiv:2406.06331 , year=
-
[52]
arXiv preprint arXiv:2404.15149 , year=
Bias patterns in the application of LLMs for clinical decision support: A comprehensive study , author=. arXiv preprint arXiv:2404.15149 , year=
-
[53]
arXiv preprint arXiv:2401.13081 , year=
Free form medical visual question answering in radiology , author=. arXiv preprint arXiv:2401.13081 , year=
-
[54]
A generalisation of Fleiss' kappa , author=
Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa , author=. arXiv preprint arXiv:2303.12502 , year=
-
[55]
2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) , pages=
Self-supervised vision-language pretraining for medial visual question answering , author=. 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) , pages=. 2023 , organization=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.