Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

Fahmi Khalifa; Frederick Akor Ejiga; Md Mahmudur Rahman; Ojonugwa Oluwafemi Ejiga Peter

arxiv: 2605.24792 · v1 · pith:Z4ZPQRZ2new · submitted 2026-05-24 · 💻 cs.CV · cs.AI

Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

Ojonugwa Oluwafemi Ejiga Peter , Frederick Akor Ejiga , Fahmi Khalifa , Md Mahmudur Rahman This is my paper

Pith reviewed 2026-06-30 12:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords parameter-efficient fine-tuningvisual question answeringsynthetic medical imagesgastrointestinal endoscopyLoRA adaptationFlorence-2Stable Diffusionprivacy-preserving generation

0 comments

The pith

Parameter-efficient fine-tuning enables high-scoring visual question answering and privacy-preserving synthetic image generation for gastrointestinal endoscopy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to overcome shortages of annotated data, privacy restrictions, and high fine-tuning costs that limit AI use in gastrointestinal endoscopy. It does so with a dual pipeline that applies parameter-efficient fine-tuning to the Florence-2 model for clinical visual question answering and to Stable Diffusion 2.1 via Low-Rank Adaptation for creating synthetic images. A reader would care because the approach promises to let hospitals train capable models on limited public data while generating usable private-like images and cutting training expense sharply. The work reports that the resulting VQA model reaches ROUGE-1 of 0.92 and that rank-4 LoRA synthesis yields the strongest image-text alignment among tested generators.

Core claim

The authors claim that a parameter-efficient dual pipeline, built on Florence-2 for visual question answering and rank-4 LoRA adaptation of Stable Diffusion 2.1 for image synthesis, delivers ROUGE-1 of 0.92, ROUGE-L of 0.91, BLEU improvement from 0.08 to 0.24, fidelity 0.290, agreement 0.730, and Frechet BiomedCLIP Distance of 1450 on the Kvasir-VQA dataset, reduces computational cost by nearly 90 percent, and produces better image-text coherence than FLUX, MSDM, and Kandinsky 2.2 while performing even better after fine-tuning on private data.

What carries the argument

The dual-pipeline parameter-efficient fine-tuning framework that combines the Florence-2 vision-language model for visual question answering with Low-Rank Adaptation on Stable Diffusion 2.1 for medical image generation.

If this is right

Fine-tuning on private datasets produces better results than fine-tuning on public datasets alone.
The rank-4 LoRA setting gives the best combination of fidelity, agreement, and Frechet BiomedCLIP Distance among the ranks tested.
The generated images achieve lower Frechet BiomedCLIP Distance than those from FLUX, MSDM, and Kandinsky 2.2, showing stronger semantic alignment.
Training cost drops by almost 90 percent relative to standard fine-tuning while maintaining the reported metric levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be tested on other endoscopy tasks such as polyp detection or lesion segmentation to check whether synthetic data helps there as well.
Hospitals without large public datasets might still obtain usable models by combining limited private data with the generated images.
If the synthetic images preserve enough clinical detail, they could reduce the need to share real patient images across institutions.
The observed advantage of private-data fine-tuning implies that deployment would likely require each site to run its own adaptation step.

Load-bearing premise

The performance numbers measured on the public Kvasir-VQA dataset indicate real clinical usefulness and that the synthetic images will improve downstream tasks without further checks on private clinical data.

What would settle it

Running the trained VQA model on a new private clinical endoscopy dataset and measuring whether adding the synthetic images raises accuracy would show whether the reported gains transfer outside the public test set.

Figures

Figures reproduced from arXiv: 2605.24792 by Fahmi Khalifa, Frederick Akor Ejiga, Md Mahmudur Rahman, Ojonugwa Oluwafemi Ejiga Peter.

**Figure 1.** Figure 1: Dual-pipeline architecture for PEFT gastrointestinal endoscopy AI system. Left: Florence-2-based VQA pipeline with frozen vision encoder and PEFT. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Performance comparison of VQA metrics across eight experiments [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visual Question Answering (VQA) example for gastrointestinal [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Fidelity and Agreement scores across three experiments using public [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Frechet BiomedCLIP Distance (FBD) scores. Lower values indicate [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies known PEFT and LoRA methods to Kvasir-VQA but keeps the VQA and synthesis pipelines separate with no test of whether the generated images raise downstream performance.

read the letter

This paper takes existing parameter-efficient methods and applies them to visual question answering and image generation for gastrointestinal endoscopy on the Kvasir-VQA dataset. The core result is efficiency gains without new algorithms, but the experiments stop short of showing that the synthetic images help the VQA model.

It does a clean job of fine-tuning Florence-2 with PEFT for the question answering part and using rank-4 LoRA on Stable Diffusion 2.1 for generating images. The reported numbers include ROUGE-1 at 0.92 and a 90 percent drop in compute for generation, plus comparisons on FBD against FLUX and others. The setup is practical and the metrics are presented directly.

The soft spot is the lack of integration between the two parts. The VQA scores and the image generation scores are given separately, with no ablation that trains the VQA model on a mix of real and synthetic data to check for improvement. The abstract mentions better performance on private datasets, but without specifics on data exclusion or how the synthetic images factor in. Those high ROUGE and BLEU numbers on public data also need more context on what counts as a good answer in a clinical sense.

This work is aimed at people building AI tools for endoscopy who need to work around data limits and privacy rules. A reader looking for ready-to-adapt code patterns or baseline numbers on this dataset might find it useful.

I would send it to peer review. The application is clear, but referees should press for the missing experiments that tie the synthetic data to actual VQA gains.

Referee Report

3 major / 2 minor

Summary. The paper presents a dual-pipeline PEFT framework for GI endoscopy AI: (1) Florence-2 with parameter-efficient fine-tuning for clinical VQA on the Kvasir-VQA dataset, reporting ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU improvement from 0.08 to 0.24; (2) rank-4 LoRA on Stable Diffusion 2.1 for privacy-preserving synthetic image generation, achieving fidelity 0.290, agreement 0.730, and FBD 1450 while claiming ~90% compute reduction. It asserts that the approach addresses data scarcity/privacy, outperforms FLUX/MSDM/Kandinsky 2.2 on FBD, and improves clinical potential, with a note that private-dataset fine-tuning outperforms public.

Significance. If the unshown link between synthetic images and improved VQA performance holds and generalizes beyond Kvasir-VQA, the work would offer a practical, compute-efficient route to dataset augmentation under privacy constraints. The explicit use of LoRA rank-4 and direct comparison of FBD against three other generators are concrete strengths; however, the absence of any ablation or downstream evaluation means the significance remains conditional on those missing experiments.

major comments (3)

[Abstract / Results] Abstract and results: the central claim that synthetic images 'enhance training databases' and improve clinical VQA is unsupported because no experiment trains the Florence-2 VQA model on real + synthetic data and reports the resulting ROUGE/BLEU change on a held-out set. The two pipelines are evaluated separately.
[Abstract] Abstract: the statement 'Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets' is presented without the corresponding table, split, or test-set details; the headline ROUGE/BLEU numbers are given only for Kvasir-VQA, leaving the clinical-utility claim without a private test-set evaluation.
[Abstract] Abstract: the reported FBD of 1450 is called 'optimal' and superior for image-text coherence, yet no ablation or human-expert rating demonstrates that images at this FBD level actually raise downstream VQA accuracy when added to training; the metric comparison to FLUX/MSDM/Kandinsky therefore does not yet establish utility for the stated goal.

minor comments (2)

[Abstract] Abstract: the phrase 'reducing computational costs by almost 90 percent' should be accompanied by the exact baseline (full fine-tuning FLOPs or wall-clock time) and the measured reduction for rank-4 LoRA.
[Abstract] Abstract: 'Fidelity score of 0.290' and 'agreement score of 0.730' are reported without definitions or references to the exact formulas or human-evaluation protocol used.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the insightful comments. We address each major comment point by point below. We agree that several claims in the abstract require clarification or additional support, and we will make revisions to ensure the claims are accurately supported by the presented experiments.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results: the central claim that synthetic images 'enhance training databases' and improve clinical VQA is unsupported because no experiment trains the Florence-2 VQA model on real + synthetic data and reports the resulting ROUGE/BLEU change on a held-out set. The two pipelines are evaluated separately.

Authors: We acknowledge that the manuscript evaluates the VQA and synthetic image generation pipelines independently, without an experiment that combines synthetic images with real data to measure improvement in VQA metrics. The claim in the abstract that synthetic images enhance training databases is aspirational based on the generation quality, but not empirically demonstrated in this work. We will revise the abstract and introduction to remove or qualify this claim, stating that the generation pipeline provides a means to augment datasets while preserving privacy, with the impact on VQA left for future investigation. revision: yes
Referee: [Abstract] Abstract: the statement 'Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets' is presented without the corresponding table, split, or test-set details; the headline ROUGE/BLEU numbers are given only for Kvasir-VQA, leaving the clinical-utility claim without a private test-set evaluation.

Authors: The statement regarding private datasets is based on additional experiments conducted on proprietary data. However, due to privacy constraints, detailed tables and splits were not included in the main manuscript. We will add a note or supplementary material with aggregated results or anonymized details to substantiate this claim, while ensuring compliance with data privacy. revision: partial
Referee: [Abstract] Abstract: the reported FBD of 1450 is called 'optimal' and superior for image-text coherence, yet no ablation or human-expert rating demonstrates that images at this FBD level actually raise downstream VQA accuracy when added to training; the metric comparison to FLUX/MSDM/Kandinsky therefore does not yet establish utility for the stated goal.

Authors: The FBD metric is used to compare image-text coherence among generators, and our model achieves a lower FBD indicating better alignment. However, we agree that this does not directly translate to improved VQA performance without further experiments. We will revise the wording in the abstract to describe the FBD result as superior in terms of the metric rather than claiming it as 'optimal' for clinical VQA utility. revision: yes

standing simulated objections not resolved

The lack of a combined experiment showing synthetic data improving VQA performance cannot be addressed without conducting new experiments, which are beyond the scope of a revision response.

Circularity Check

0 steps flagged

No circularity: empirical metrics reported directly from evaluations

full rationale

The paper describes an applied dual-pipeline setup (Florence-2 VQA fine-tuned with PEFT; rank-4 LoRA on Stable Diffusion 2.1) evaluated on the public Kvasir-VQA dataset, with reported ROUGE/BLEU/FBD numbers presented as direct outputs of those runs. No equations, derivations, fitted-parameter predictions, or self-citation chains are invoked to justify a central claim; the results do not reduce to inputs by construction. Any potential self-citations would be non-load-bearing for the reported scores, which remain falsifiable external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond standard use of pre-trained models and the choice of LoRA rank 4. No new entities are postulated.

free parameters (1)

LoRA rank
Selected as optimal for the synthesis task; value 4 is stated without derivation from first principles.

pith-pipeline@v0.9.1-grok · 5888 in / 1388 out tokens · 33796 ms · 2026-06-30T12:35:02.744672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 16 canonical work pages

[1]

Mayo Clinic, ”Upper endoscopy - Mayo Clinic,” Mayoclinic.org,
[2]

Available: https://www.mayoclinic.org/tests- procedures/endoscopy/about/pac-20395197

[Online]. Available: https://www.mayoclinic.org/tests- procedures/endoscopy/about/pac-20395197. Accessed on: Jun. 2025

2025
[3]

[Online]

Segmed Team, ”Role of computer vision & synthetic data in transforming medical imaging,” Segmed.ai, 2025. [Online]. Avail- able: https://www.segmed.ai/resources/blog/the-role-of-computer-vision- and-synthetic-data-in-transforming-medical-imaging. Accessed on: Jun. 15, 2025

2025
[4]

O. O. Ejiga Peter, O. T. Adeniran, J.-O. A. MacGregor, F. Khalifa, and M. M. Rahman, ”Text-guided synthesis in medical multimedia retrieval: A framework for enhanced colonoscopy image classification and seg- mentation,”Algorithms, vol. 18, p. 155, 2025, doi: 10.3390/a18030155

work page doi:10.3390/a18030155 2025
[5]

[Online]

Kanerika, ”Parameter efficient fine tuning,” Kanerika, 2024. [Online]. Available: https://kanerika.com/blogs/parameter-efficient-fine-tuning/. Accessed on: Jun. 15, 2025

2024
[6]

B. P. Veasey and A. A. Amini, ”Low-rank adaptation of pre-trained large vision models for improved lung nodule malignancy classifica- tion,”IEEE Open J. Eng. Med. Biol., vol. 6, pp. 296–304, 2025, doi: 10.1109/ojemb.2025.3530841

work page doi:10.1109/ojemb.2025.3530841 2025
[7]

W. Peng, K. Liu, J. Hu, and M. Zhang, ”Biomed-dpt: Dual modality prompt tuning for biomedical vision-language models,” arXiv : arXiv:2505.05189, 2021. [Online]. Available: https://arxiv.org/html/2505.05189v1. Accessed on: Jun. 2025

work page arXiv 2021
[8]

Zhuet al., ”Guiding medical vision-language models with explicit visual prompts: Framework design and comprehensive exploration of prompt variations,” inProc

K. Zhuet al., ”Guiding medical vision-language models with explicit visual prompts: Framework design and comprehensive exploration of prompt variations,” inProc. NAACL, vol. 1, pp. 11726–11739, 2025, doi: 10.18653/v1/2025.naacl-long.587

work page doi:10.18653/v1/2025.naacl-long.587 2025
[9]

[Online]

NVIDIA, ”What are vision-language models?,” NVIDIA, 2025. [Online]. Available: https://www.nvidia.com/en-us/glossary/vision- language-models/. Accessed on: Jun. 15, 2025

2025
[10]

Yilmaz, F

A. Yilmaz, F. Yuceyalcin, R. Varol, E. Gokyayla, and O. Er- dem, ”A synthetic data generation framework for scalable and resource-efficient medical AI assistants,” 2025. [Online]. Available: https://doi.org/10.1101/2025.05.17.25327785. Accessed on: Jun. 15, 2025

work page doi:10.1101/2025.05.17.25327785 2025
[11]

O. O. Ejiga Peter, ”Advancing AI-Powered medical image synthe- sis: Insights from MedVQA-GI challenge using CLIP, fine-tuned sta- ble diffusion, and dream-booth + LoRA,” arXiv : arXiv:2502.20667,

work page arXiv
[12]

Available: https://doi.org/10.48550/arXiv.2502.20667

[Online]. Available: https://doi.org/10.48550/arXiv.2502.20667. Accessed on: Jun. 15, 2025

work page doi:10.48550/arxiv.2502.20667 2025
[13]

Janut ˙enas and D

L. Janut ˙enas and D. ˇSeˇsok, ”Perspective transformation and viewpoint attention enhancement for generative adversarial networks in endoscopic image augmentation,”Applied Sciences, vol. 15, pp. 5655–5655, 2025, doi: 10.3390/app15105655

work page doi:10.3390/app15105655 2025
[14]

[Online]

Bayer, ”Synthetic data in medical imaging,” Pistoia Alliance, 2025. [Online]. Available: https://www.pistoiaalliance.org/new-idea/synthetic- data-in-medical-imaging/. Accessed on: Jun. 2025

2025
[15]

Canepa, S

L. Canepa, S. Singh, and A. Sowmya, ”Visual question answering in the medical domain,” arXiv : arXiv:2309.11080, 2023. [Online]. Available: https://arxiv.org/abs/2309.11080. Accessed on: Jun. 2025

work page arXiv 2023
[16]

W. Dong, S. Shen, Y . Han, T. Tan, J. Wu, and H. Xu, ”Generative models in medical visual question answering: A survey,”Applied Sciences, vol. 15, pp. 2983–2983, 2025, doi: 10.3390/app15062983

work page doi:10.3390/app15062983 2025
[17]

Z. Zeng, Z. Zhuo, X. Jia, and Erli, ”SurgVLM: A large vision-language model and systematic evaluation benchmark for surgical intelligence,” 2025

2025
[18]

S. Liu, J. Shao, B. Zheng, and W. C. Chen, ”EndoBench: A comprehensive evaluation of multi-modal large language mod- els for endoscopy analysis,” arXiv :, 2023. [Online]. Available: https://arxiv.org/html/2505.23601v1. Accessed on: Jun. 2025

work page arXiv 2023
[19]

Khanal, S

B. Khanal, S. Pokhrel, and S. Bhandar, ”Hallucination-aware multimodal benchmark for gastrointestinal image analysis with large vision-language models,” arXiv :, 2025. [Online]. Available: https://arxiv.org/html/2505.07001v1

work page arXiv 2025
[20]

Huang, L

X. Huang, L. Shen, and J. Liu, ”Towards a multimodal large language model with pixel-level insight for biomedicine,” 2025

2025
[21]

Gautam, P

S. Gautam, P. Halvorsen, and M. A. Riegler, ”Point, detect, count: Multi-task medical image understanding with instruction-tuned vision- language models,” 2025

2025
[23]

Yanet al., ”Vision-language large learning model, GPT4V , accu- rately classifies the Boston Bowel Preparation Scale score,”BMJ Open Gastroenterology, vol

D. Yanet al., ”Vision-language large learning model, GPT4V , accu- rately classifies the Boston Bowel Preparation Scale score,”BMJ Open Gastroenterology, vol. 12, p. e001496, 2025, doi: 10.1136/bmjgast-2024- 001496

work page doi:10.1136/bmjgast-2024- 2025
[24]

O. O. Ejiga Peter, O. G. Akingbola, C. R. Amalahu, O. Adeniran, F. Khakifa, and M. M. Rahman, ”Synthetic data-driven multi-architecture framework for automated polyp segmentation through integrated de- tection and mask generation,” inMedical Imaging 2025: Clinical and Biomedical Imaging, p. 78, Mar. 2025, doi: 10.1117/12.3049369

work page doi:10.1117/12.3049369 2025
[25]

Elamin, S

S. Elamin, S. Johri, P. Rajpurkar, E. Geisler, and T. M. Berzin, ”From data to artificial intelligence: evaluating the readiness of gastrointestinal endoscopy datasets,”, 2025

2025
[26]

Gautam, A

S. Gautam, A. Stor ˚as, C. Midoglu, S. A. Hicks, V . Thambawita, P. Halvorsen, and M. A. Riegler, ”Kvasir-VQA: A text-image pair GI tract dataset,” in *Proc. 1st Int. Workshop Vision-Language Models for Biomedical Applications (VLM4Bio ’24)*, Melbourne, VIC, Australia, 2024, pp. 10, ACM.Doi: 10.1145/3689096.3689458

work page doi:10.1145/3689096.3689458 2024
[27]

O. O. Ejiga Peter, ”Advancing colonoscopy analysis through text-to-image synthesis using generative AI for intelligent data augmentation, image classification, and segmentation,”*ProQuest Dissertations Publishing*, 2024. [Online]. Available: https://www.proquest.com/openview/9a3add722e60af686957df5383de11f5/1?pq- origsite=gscholar&cbl=18750&diss=y [Access...

2024
[28]

Chaichuk, S

M. Chaichuk, S. Gautam, S. A. Hicks, and E. Tutubalina, ”Prompt to Polyp: Medical Text-Conditioned Image Synthesis with Diffu- sion Models,” arXiv : arXiv:2505.05573, 2025. [Online]. Available: https://arxiv.org/abs/2505.05573

work page arXiv 2025

[1] [1]

Mayo Clinic, ”Upper endoscopy - Mayo Clinic,” Mayoclinic.org,

[2] [2]

Available: https://www.mayoclinic.org/tests- procedures/endoscopy/about/pac-20395197

[Online]. Available: https://www.mayoclinic.org/tests- procedures/endoscopy/about/pac-20395197. Accessed on: Jun. 2025

2025

[3] [3]

[Online]

Segmed Team, ”Role of computer vision & synthetic data in transforming medical imaging,” Segmed.ai, 2025. [Online]. Avail- able: https://www.segmed.ai/resources/blog/the-role-of-computer-vision- and-synthetic-data-in-transforming-medical-imaging. Accessed on: Jun. 15, 2025

2025

[4] [4]

O. O. Ejiga Peter, O. T. Adeniran, J.-O. A. MacGregor, F. Khalifa, and M. M. Rahman, ”Text-guided synthesis in medical multimedia retrieval: A framework for enhanced colonoscopy image classification and seg- mentation,”Algorithms, vol. 18, p. 155, 2025, doi: 10.3390/a18030155

work page doi:10.3390/a18030155 2025

[5] [5]

[Online]

Kanerika, ”Parameter efficient fine tuning,” Kanerika, 2024. [Online]. Available: https://kanerika.com/blogs/parameter-efficient-fine-tuning/. Accessed on: Jun. 15, 2025

2024

[6] [6]

B. P. Veasey and A. A. Amini, ”Low-rank adaptation of pre-trained large vision models for improved lung nodule malignancy classifica- tion,”IEEE Open J. Eng. Med. Biol., vol. 6, pp. 296–304, 2025, doi: 10.1109/ojemb.2025.3530841

work page doi:10.1109/ojemb.2025.3530841 2025

[7] [7]

W. Peng, K. Liu, J. Hu, and M. Zhang, ”Biomed-dpt: Dual modality prompt tuning for biomedical vision-language models,” arXiv : arXiv:2505.05189, 2021. [Online]. Available: https://arxiv.org/html/2505.05189v1. Accessed on: Jun. 2025

work page arXiv 2021

[8] [8]

Zhuet al., ”Guiding medical vision-language models with explicit visual prompts: Framework design and comprehensive exploration of prompt variations,” inProc

K. Zhuet al., ”Guiding medical vision-language models with explicit visual prompts: Framework design and comprehensive exploration of prompt variations,” inProc. NAACL, vol. 1, pp. 11726–11739, 2025, doi: 10.18653/v1/2025.naacl-long.587

work page doi:10.18653/v1/2025.naacl-long.587 2025

[9] [9]

[Online]

NVIDIA, ”What are vision-language models?,” NVIDIA, 2025. [Online]. Available: https://www.nvidia.com/en-us/glossary/vision- language-models/. Accessed on: Jun. 15, 2025

2025

[10] [10]

Yilmaz, F

A. Yilmaz, F. Yuceyalcin, R. Varol, E. Gokyayla, and O. Er- dem, ”A synthetic data generation framework for scalable and resource-efficient medical AI assistants,” 2025. [Online]. Available: https://doi.org/10.1101/2025.05.17.25327785. Accessed on: Jun. 15, 2025

work page doi:10.1101/2025.05.17.25327785 2025

[11] [11]

O. O. Ejiga Peter, ”Advancing AI-Powered medical image synthe- sis: Insights from MedVQA-GI challenge using CLIP, fine-tuned sta- ble diffusion, and dream-booth + LoRA,” arXiv : arXiv:2502.20667,

work page arXiv

[12] [12]

Available: https://doi.org/10.48550/arXiv.2502.20667

[Online]. Available: https://doi.org/10.48550/arXiv.2502.20667. Accessed on: Jun. 15, 2025

work page doi:10.48550/arxiv.2502.20667 2025

[13] [13]

Janut ˙enas and D

L. Janut ˙enas and D. ˇSeˇsok, ”Perspective transformation and viewpoint attention enhancement for generative adversarial networks in endoscopic image augmentation,”Applied Sciences, vol. 15, pp. 5655–5655, 2025, doi: 10.3390/app15105655

work page doi:10.3390/app15105655 2025

[14] [14]

[Online]

Bayer, ”Synthetic data in medical imaging,” Pistoia Alliance, 2025. [Online]. Available: https://www.pistoiaalliance.org/new-idea/synthetic- data-in-medical-imaging/. Accessed on: Jun. 2025

2025

[15] [15]

Canepa, S

L. Canepa, S. Singh, and A. Sowmya, ”Visual question answering in the medical domain,” arXiv : arXiv:2309.11080, 2023. [Online]. Available: https://arxiv.org/abs/2309.11080. Accessed on: Jun. 2025

work page arXiv 2023

[16] [16]

W. Dong, S. Shen, Y . Han, T. Tan, J. Wu, and H. Xu, ”Generative models in medical visual question answering: A survey,”Applied Sciences, vol. 15, pp. 2983–2983, 2025, doi: 10.3390/app15062983

work page doi:10.3390/app15062983 2025

[17] [17]

Z. Zeng, Z. Zhuo, X. Jia, and Erli, ”SurgVLM: A large vision-language model and systematic evaluation benchmark for surgical intelligence,” 2025

2025

[18] [18]

S. Liu, J. Shao, B. Zheng, and W. C. Chen, ”EndoBench: A comprehensive evaluation of multi-modal large language mod- els for endoscopy analysis,” arXiv :, 2023. [Online]. Available: https://arxiv.org/html/2505.23601v1. Accessed on: Jun. 2025

work page arXiv 2023

[19] [19]

Khanal, S

B. Khanal, S. Pokhrel, and S. Bhandar, ”Hallucination-aware multimodal benchmark for gastrointestinal image analysis with large vision-language models,” arXiv :, 2025. [Online]. Available: https://arxiv.org/html/2505.07001v1

work page arXiv 2025

[20] [20]

Huang, L

X. Huang, L. Shen, and J. Liu, ”Towards a multimodal large language model with pixel-level insight for biomedicine,” 2025

2025

[21] [21]

Gautam, P

S. Gautam, P. Halvorsen, and M. A. Riegler, ”Point, detect, count: Multi-task medical image understanding with instruction-tuned vision- language models,” 2025

2025

[22] [23]

Yanet al., ”Vision-language large learning model, GPT4V , accu- rately classifies the Boston Bowel Preparation Scale score,”BMJ Open Gastroenterology, vol

D. Yanet al., ”Vision-language large learning model, GPT4V , accu- rately classifies the Boston Bowel Preparation Scale score,”BMJ Open Gastroenterology, vol. 12, p. e001496, 2025, doi: 10.1136/bmjgast-2024- 001496

work page doi:10.1136/bmjgast-2024- 2025

[23] [24]

O. O. Ejiga Peter, O. G. Akingbola, C. R. Amalahu, O. Adeniran, F. Khakifa, and M. M. Rahman, ”Synthetic data-driven multi-architecture framework for automated polyp segmentation through integrated de- tection and mask generation,” inMedical Imaging 2025: Clinical and Biomedical Imaging, p. 78, Mar. 2025, doi: 10.1117/12.3049369

work page doi:10.1117/12.3049369 2025

[24] [25]

Elamin, S

S. Elamin, S. Johri, P. Rajpurkar, E. Geisler, and T. M. Berzin, ”From data to artificial intelligence: evaluating the readiness of gastrointestinal endoscopy datasets,”, 2025

2025

[25] [26]

Gautam, A

S. Gautam, A. Stor ˚as, C. Midoglu, S. A. Hicks, V . Thambawita, P. Halvorsen, and M. A. Riegler, ”Kvasir-VQA: A text-image pair GI tract dataset,” in *Proc. 1st Int. Workshop Vision-Language Models for Biomedical Applications (VLM4Bio ’24)*, Melbourne, VIC, Australia, 2024, pp. 10, ACM.Doi: 10.1145/3689096.3689458

work page doi:10.1145/3689096.3689458 2024

[26] [27]

O. O. Ejiga Peter, ”Advancing colonoscopy analysis through text-to-image synthesis using generative AI for intelligent data augmentation, image classification, and segmentation,”*ProQuest Dissertations Publishing*, 2024. [Online]. Available: https://www.proquest.com/openview/9a3add722e60af686957df5383de11f5/1?pq- origsite=gscholar&cbl=18750&diss=y [Access...

2024

[27] [28]

Chaichuk, S

M. Chaichuk, S. Gautam, S. A. Hicks, and E. Tutubalina, ”Prompt to Polyp: Medical Text-Conditioned Image Synthesis with Diffu- sion Models,” arXiv : arXiv:2505.05573, 2025. [Online]. Available: https://arxiv.org/abs/2505.05573

work page arXiv 2025