pith. sign in

arxiv: 2605.18111 · v1 · pith:LTX6WV2Pnew · submitted 2026-05-18 · 💻 cs.CL · cs.CV

How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

Pith reviewed 2026-05-20 10:55 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords modelsmedicalbanglamedvqaquestionsansweringclinicallycomplex
0
0 comments X

The pith

Foundation models perform substantially lower on Bangla medical visual questions than on English benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BanglaMedVQA, the first dataset of clinically validated medical images paired with questions and answers in Bangla. Evaluation of current large vision-language models on this dataset shows markedly worse results than on established English medical benchmarks. Even leading systems such as Gemini and GPT-4.1 mini fail on questions requiring precise diagnostic reasoning. A sympathetic reader would care because Bangla is spoken by over 250 million people and weak performance limits the potential for AI-assisted medical support in those regions.

Core claim

The authors create BanglaMedVQA with clinically validated image-question-answer pairs and demonstrate through evaluation that current foundation models exhibit substantially lower performance on Bangla medical visual questions compared to English benchmarks. Even top models fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning for low-resource languages.

What carries the argument

BanglaMedVQA dataset of clinically validated image-question-answer pairs, used to benchmark foundation models and reveal language-specific performance gaps in medical visual question answering.

If this is right

  • Performance remains especially poor on specialized diagnostic questions across all tested models.
  • Certain open-source models occasionally match closed models on general categories but still fail on complex clinical questions.
  • The results underscore the urgent need for improved evaluation methods suited to low-resource medical domains.
  • Bangla performance gaps reflect broader challenges inherent to low-resource languages in medical reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benchmarks for other low-resource languages could expose parallel gaps in medical AI capabilities.
  • Real-world deployment tests in Bangla-speaking clinics would reveal whether the benchmark gaps translate to practical diagnostic errors.
  • Targeted fine-tuning on Bangla medical data offers a direct way to test whether the observed limitations can be reduced.
  • The dataset could serve as a starting point for comparing multilingual medical VQA progress across additional languages.
  • keywords:[

Load-bearing premise

The image-question-answer pairs accurately represent real clinical scenarios in Bangla-speaking regions and the evaluation protocol isolates language limitations rather than dataset artifacts or prompting choices.

What would settle it

A model achieving accuracy on BanglaMedVQA comparable to its English MedVQA scores after targeted Bangla medical fine-tuning would challenge the claim of inherent low-resource limitations.

Figures

Figures reproduced from arXiv: 2605.18111 by Intesar Tahmid, Md Fahim, Md Farhad Alam Bhuiyan, Mir Sazzat Hossain, Rafid Ahmed, Tasnimul Hossain Tomal.

Figure 1
Figure 1. Figure 1: Workflow of the dataset curation process. Images and metadata were obtained from two widely used biomedical [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of clinical conditions and question keywords in the curated dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LAVE score comparison of the Chest X-Ray dataset for different models under four different settings: vanilla (baseline [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LAVE score comparison of the MedICat dataset for different models under four different settings: vanilla (baseline [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LAVE score comparison on the Chest X-Ray dataset across categorical question types with chain-of-thought reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LAVE score comparison on the MedICat dataset across categorical question types with chain-of-thought reasoning in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Error Analysis of Medical Visual Question Answering (MedVQA) pairs from the proposed dataset, showcasing [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces BanglaMedVQA, a new dataset of clinically validated Bangla-language medical image-question-answer pairs, and reports a comprehensive benchmark of closed-source and open-source LVLMs on this resource. The central claim is that model performance on BanglaMedVQA is substantially lower than on existing English MedVQA benchmarks, with even the strongest models (Gemini, GPT-4.1-mini) failing on specialized diagnostic questions.

Significance. If the dataset construction and evaluation controls are sound, the work supplies the first public benchmark for Bangla MedVQA and quantifies the additional difficulty current foundation models face in low-resource-language medical visual reasoning. It could usefully motivate targeted data collection or fine-tuning efforts for Bangla medical applications.

major comments (3)
  1. [Dataset Construction] Dataset section: the claim that the image-question-answer pairs are 'clinically validated' is not supported by any reported details on the number or qualifications of medical experts, the validation protocol, or inter-annotator agreement statistics. Without these, it is impossible to determine whether the observed performance gap reflects language-specific limitations or properties of the dataset construction itself.
  2. [Evaluation Protocol] Evaluation section: the paper does not specify the exact prompting templates (zero-shot vs. few-shot, language of the prompt, presence of chain-of-thought), the answer-extraction procedure, or the precise metric definitions used for each model. These omissions are load-bearing for the claim that the gap isolates language effects rather than prompting or parsing artifacts.
  3. [Results and Comparison] Results section: the assertion of 'substantially lower' performance relative to English MedVQA benchmarks does not identify the reference English dataset(s) or demonstrate that question complexity, image distribution, and answer-type balance are matched. This weakens the attribution of the gap to low-resource language challenges.
minor comments (1)
  1. [Abstract] Abstract: the model identifier 'GPT-4.1 mini' is non-standard; clarify whether this refers to GPT-4o-mini or another variant.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our work.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset section: the claim that the image-question-answer pairs are 'clinically validated' is not supported by any reported details on the number or qualifications of medical experts, the validation protocol, or inter-annotator agreement statistics. Without these, it is impossible to determine whether the observed performance gap reflects language-specific limitations or properties of the dataset construction itself.

    Authors: We agree that the original manuscript provided insufficient detail on the clinical validation process. The dataset was constructed with input from three licensed physicians (two with specialization in diagnostic radiology and one in internal medicine), using a protocol of independent review by each expert followed by a consensus discussion for disagreements. Inter-annotator agreement reached a Cohen's kappa of 0.81 on a held-out sample of 150 pairs. We will add a dedicated paragraph in the revised Dataset section describing the experts' qualifications, the full validation protocol, and the agreement statistics to better substantiate the clinical validation claim. revision: yes

  2. Referee: [Evaluation Protocol] Evaluation section: the paper does not specify the exact prompting templates (zero-shot vs. few-shot, language of the prompt, presence of chain-of-thought), the answer-extraction procedure, or the precise metric definitions used for each model. These omissions are load-bearing for the claim that the gap isolates language effects rather than prompting or parsing artifacts.

    Authors: We acknowledge that these protocol details were not fully specified. In the revision, we will insert a new Evaluation Protocol subsection clarifying that all models were evaluated zero-shot using standardized English prompts (to enable fair cross-lingual comparison), without chain-of-thought instructions in the main results. Answer extraction uses rule-based parsing to isolate the final answer token or phrase after stripping explanatory text, with exact-match accuracy for closed-ended questions and a combination of BLEU-4 and ROUGE-L for open-ended responses. These additions will allow readers to assess whether the reported gaps are attributable to language rather than implementation choices. revision: yes

  3. Referee: [Results and Comparison] Results section: the assertion of 'substantially lower' performance relative to English MedVQA benchmarks does not identify the reference English dataset(s) or demonstrate that question complexity, image distribution, and answer-type balance are matched. This weakens the attribution of the gap to low-resource language challenges.

    Authors: We thank the referee for this point. The primary comparisons were to VQA-RAD and SLAKE as representative English MedVQA benchmarks. We recognize that explicit distributional matching was not provided. In the revised Results section, we will include a new table summarizing question-type distributions (diagnostic, descriptive, etc.), image modalities, and answer-length statistics for BanglaMedVQA versus the English references. While perfect matching across all dimensions is not feasible given the distinct clinical contexts, this will better support our interpretation of the performance differences as reflecting low-resource language challenges while transparently noting remaining dataset differences. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical dataset benchmarking

full rationale

The paper introduces BanglaMedVQA as a new clinically validated dataset and reports direct model evaluation results on it. No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims rest on observable model outputs against external English MedVQA benchmarks rather than self-referential definitions or self-citation chains. The work is self-contained and externally falsifiable via the released dataset and standard prompting protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on the assumption that clinically validated pairs can be created for Bangla medical images and that standard VQA evaluation metrics transfer meaningfully to this low-resource setting; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Clinically validated image-question-answer pairs can be reliably constructed for Bangla medical contexts
    Invoked in the dataset introduction and evaluation claims
  • domain assumption Performance differences between Bangla and English MedVQA reflect inherent language-resource challenges rather than dataset construction artifacts
    Central to the comparison and conclusion about low-resource languages

pith-pipeline@v0.9.0 · 5761 in / 1376 out tokens · 19178 ms · 2026-05-20T10:55:33.630728+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 9 internal anchors

  1. [1]

    Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education

    Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)

  2. [2]

    Classification Problem Solving

    Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence

  3. [3]

    , title =

    Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =

  4. [4]

    New Ways to Make Microcircuits Smaller---Duplicate Entry

    Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science

  5. [5]

    Clancey and Glenn Rennels , abstract =

    Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =

  6. [6]

    and Rennels, Glenn R

    Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies

  7. [7]

    Poligon: A System for Parallel Problem Solving

    Rice, James. Poligon: A System for Parallel Problem Solving

  8. [8]

    Transfer of Rule-Based Expertise through a Tutorial Dialogue

    Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue

  9. [9]

    The Engineering of Qualitative Models

    Clancey, William J. The Engineering of Qualitative Models

  10. [10]

    2017 , eprint=

    Attention Is All You Need , author=. 2017 , eprint=

  11. [11]

    Pluto: The 'Other' Red Planet

    NASA. Pluto: The 'Other' Red Planet

  12. [12]

    Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

    SentNoB: A dataset for analysing sentiment on noisy Bangla texts , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

  13. [13]

    ArXiv , year=

    MedICaT: A Dataset of Medical Images, Captions, and Textual References , author=. ArXiv , year=

  14. [14]

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases , author=. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  15. [15]

    Lawrence and Parikh, Devi , title =

    Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =

  16. [16]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  17. [17]

    and Singh, Saurabh and Hoiem, Derek , booktitle=

    Shih, Kevin J. and Singh, Saurabh and Hoiem, Derek , booktitle=. Where to Look: Focus Regions for Visual Question Answering , year=

  18. [18]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    KVQA: Knowledge-Aware Visual Question Answering , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2019 , month=. doi:10.1609/aaai.v33i01.33018876 , number=

  19. [19]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

    Gu, Tiancheng and Yang, Kaicheng and Liu, Dongnan and Cai, Weidong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =. 2024 , pages =

  20. [20]

    proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , year =

    Yim, Wen-wai and Fu, Yujuan and Sun, Zhaoyi and Ben Abacha, Asma and Yetisgen, Meliha and Xia, Fei , title =. proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , year =

  21. [21]

    M ed T hink: A Rationale-Guided Framework for Explaining Medical Visual Question Answering

    Gai, Xiaotang and Zhou, Chenyi and Liu, Jiaxiang and Feng, Yang and Wu, Jian and Liu, Zuozhu. M ed T hink: A Rationale-Guided Framework for Explaining Medical Visual Question Answering. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.415

  22. [22]

    ML e VLM : Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering

    Xu, Dexuan and Chen, Yanyuan and Wang, Jieyi and Huang, Yue and Wang, Hanpin and Jin, Zhi and Wang, Hongxing and Yue, Weihua and He, Jing and Li, Hang and Huang, Yu. ML e VLM : Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering. Findings of the Association for Computational Linguisti...

  23. [23]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i9.33047 , number=

  24. [24]

    2025 , eprint=

    Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement , author=. 2025 , eprint=

  25. [25]

    Farhad Alam , title=

    Deeparghya Dutta Barua and Md Sakib Ul Rahman Sourove and Md Farhan Ishmam and Fabiha Haider and Fariha Tanjim Shifat and Md Fahim and Md. Farhad Alam , title=. CoRR , volume=. 2024 , cdate=

  26. [26]

    Rafi, Mahamudul Hasan and Islam, Shifat and Hasan Imtiaz Labib, S. M. and Hasan, SM Sajid and Shah, Faisal Muhammad and Ahmed, Sifat , booktitle=. A Deep Learning-Based Bengali Visual Question Answering System , year=

  27. [27]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  28. [28]

    MedGemma Technical Report

    Medgemma technical report , author=. arXiv preprint arXiv:2507.05201 , year=

  29. [29]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  30. [31]

    arXiv e-prints , pages=

    The llama 3 herd of models , author=. arXiv e-prints , pages=

  31. [32]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  32. [33]

    Gemma 3 Technical Report

    Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

  33. [34]

    Qwen2.5-Omni Technical Report

    Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

  34. [35]

    Advances in Neural Information Processing Systems , volume=

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

  35. [36]

    International conference on machine learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  36. [37]

    2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Improved Baselines with Visual Instruction Tuning , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  37. [38]

    ArXiv , year=

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning , author=. ArXiv , year=

  38. [39]

    PaLM 2 Technical Report

    Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

  39. [40]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  40. [41]

    Information and Software Technology , volume=

    A survey on dataset quality in machine learning , author=. Information and Software Technology , volume=. 2023 , publisher=

  41. [42]

    OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , year=

    Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh , booktitle=. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , year=

  42. [43]

    European conference on computer vision , pages=

    A-okvqa: A benchmark for visual question answering using world knowledge , author=. European conference on computer vision , pages=. 2022 , organization=

  43. [44]

    arXiv preprint arXiv:2405.20421 , year=

    Worse than random? an embarrassingly simple probing evaluation of large multimodal models in medical vqa , author=. arXiv preprint arXiv:2405.20421 , year=

  44. [45]

    2025 , howpublished =

    Med VQA BN Overall , author =. 2025 , howpublished =

  45. [46]

    IEEE Reviews in Biomedical Engineering , volume=

    Automated radiology report generation: A review of recent advances , author=. IEEE Reviews in Biomedical Engineering , volume=. 2024 , publisher=

  46. [47]

    2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=

    Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=. 2021 , organization=

  47. [48]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    Pathvqa: 30000+ questions for medical visual question answering , author=. arXiv preprint arXiv:2003.10286 , year=

  48. [49]

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    Pmc-vqa: Visual instruction tuning for medical visual question answering , author=. arXiv preprint arXiv:2305.10415 , year=

  49. [50]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Improving automatic vqa evaluation using large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  50. [51]

    arXiv preprint arXiv:2406.06331 , year=

    MedExQA: Medical question answering benchmark with multiple explanations , author=. arXiv preprint arXiv:2406.06331 , year=

  51. [52]

    arXiv preprint arXiv:2404.15149 , year=

    Bias patterns in the application of LLMs for clinical decision support: A comprehensive study , author=. arXiv preprint arXiv:2404.15149 , year=

  52. [53]

    arXiv preprint arXiv:2401.13081 , year=

    Free form medical visual question answering in radiology , author=. arXiv preprint arXiv:2401.13081 , year=

  53. [54]

    A generalisation of Fleiss' kappa , author=

    Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa , author=. arXiv preprint arXiv:2303.12502 , year=

  54. [55]

    2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) , pages=

    Self-supervised vision-language pretraining for medial visual question answering , author=. 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) , pages=. 2023 , organization=