Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images
Pith reviewed 2026-05-22 23:11 UTC · model grok-4.3
The pith
ResNet50 outperforms VLMs in polyp detection and classification on colonoscopy images, though BioMedCLIP and GPT-4 remain competitive in detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the same set of preprocessed colonoscopy images, ResNet50 reached an F1 of 91.35 percent and AUROC of 0.98 for polyp detection and a weighted F1 of 74.94 percent for classification, while BioMedCLIP achieved 88.68 percent F1 in detection and GPT-4 reached 81.02 percent F1 in detection and 41.18 percent weighted F1 in classification, with other VLMs performing lower.
What carries the argument
The standardized comparative framework that applies identical preprocessing and the same detection and classification metrics to ResNet50, four classic machine-learning classifiers, CLIP, BioMedCLIP, and three general-purpose VLMs.
If this is right
- CNNs such as ResNet50 remain the most accurate option for both polyp detection and classification when supervised training data are available.
- BioMedCLIP can reach detection performance close to ResNet50 without task-specific fine-tuning.
- GPT-4 exceeds other general VLMs in both tasks but still trails CNNs and BioMedCLIP.
- When full CNN training is infeasible, BioMedCLIP or GPT-4 can supply usable detection results.
Where Pith is reading between the lines
- The performance gap between detection and classification suggests VLMs may need domain-specific adaptation to handle fine-grained polyp typing.
- VLMs could serve as an initial screening layer that flags images for later review by a trained CNN when labeled data are scarce.
- The current results imply that future VLM improvements in medical imaging may reduce reliance on large annotated datasets for basic detection tasks.
Load-bearing premise
The zero-shot or few-shot prompting used for the VLMs is equivalent in data usage and training effort to the supervised training performed on ResNet50 and the classic machine-learning models.
What would settle it
A re-evaluation in which every model, including the VLMs, is trained or prompted with exactly the same quantity of labeled images and the same compute budget, then measured on the identical test set.
read the original abstract
Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a comparative study of ResNet50, four classic machine learning models, two contrastive vision-language encoders (CLIP, BioMedCLIP), and three general-purpose VLMs (GPT-4, Gemini-1.5-Pro, Claude-3-Opus) on polyp detection (CADe) and classification (CADx) tasks using a dataset of 2,258 colonoscopy images from 428 patients. It reports that ResNet50 achieves the highest performance (F1 91.35% for detection, weighted F1 74.94% for classification), with BioMedCLIP and GPT-4 following, and concludes that CNNs are superior but VLMs may be useful when CNN training is infeasible.
Significance. If the evaluation protocols are shown to be comparable, this work provides empirical evidence on the relative strengths of CNNs versus VLMs in medical image analysis for colonoscopy, highlighting potential practical applications of VLMs in resource-constrained settings. The inclusion of multiple model types and two clinical tasks adds breadth to the comparison.
major comments (2)
- [Abstract/Methods] Abstract/Methods: The abstract and framework description do not specify the evaluation protocol for the VLMs (e.g., zero-shot prompting, few-shot examples, or fine-tuning on the 2,258-image dataset or its splits), unlike the explicit training of ResNet50. This omission prevents verification that performance differences reflect model capabilities rather than unequal access to task-specific training data, directly undermining the central claim of CNN superiority under matched conditions.
- [Results] Results: No details are provided on train/test splits, handling of class imbalance, or statistical testing (e.g., confidence intervals or p-values) for the reported metrics such as F1 scores and AUROCs, making it difficult to assess the reliability and generalizability of the performance rankings.
minor comments (2)
- [Abstract] Abstract: Placeholders [AS1] and [AS2] appear in the reported AUROC values for BioMedCLIP and GPT-4, indicating incomplete reporting.
- [Abstract] Abstract: The conclusion states VLMs 'may be useful' without quantifying the conditions under which CNN training is 'not feasible'.
Simulated Author's Rebuttal
We thank the referee for these constructive comments, which highlight important gaps in methodological transparency. We will revise the manuscript to provide the requested details on VLM evaluation protocols and experimental design elements. Below we respond point by point.
read point-by-point responses
-
Referee: [Abstract/Methods] Abstract/Methods: The abstract and framework description do not specify the evaluation protocol for the VLMs (e.g., zero-shot prompting, few-shot examples, or fine-tuning on the 2,258-image dataset or its splits), unlike the explicit training of ResNet50. This omission prevents verification that performance differences reflect model capabilities rather than unequal access to task-specific training data, directly undermining the central claim of CNN superiority under matched conditions.
Authors: We agree the protocol was insufficiently described. The general-purpose VLMs (GPT-4, Gemini-1.5-Pro, Claude-3-Opus) and contrastive encoders (CLIP, BioMedCLIP) were evaluated strictly in a zero-shot setting using carefully designed prompts; no fine-tuning or few-shot examples from the 2,258-image dataset were used. This design choice was deliberate to reflect realistic deployment scenarios where task-specific training data may be unavailable. We will expand the Methods section (and update the abstract) to state this explicitly, including the exact prompts employed, so that the comparison conditions are transparent. revision: yes
-
Referee: [Results] Results: No details are provided on train/test splits, handling of class imbalance, or statistical testing (e.g., confidence intervals or p-values) for the reported metrics such as F1 scores and AUROCs, making it difficult to assess the reliability and generalizability of the performance rankings.
Authors: We acknowledge these omissions. The dataset was partitioned at the patient level (80/20 train/test) to prevent leakage across images from the same patient, with class proportions preserved. Class imbalance was mitigated via class-weighted loss for the CNN and CML models; the zero-shot VLMs received no such adjustment. We will add these details to the Methods section and, in Results, report 95% bootstrap confidence intervals for all F1 and AUROC values together with pairwise statistical comparisons (McNemar’s test for detection, Friedman test with post-hoc for classification). revision: yes
Circularity Check
No circularity: purely empirical model comparison on fixed dataset
full rationale
The paper reports direct performance metrics (F1, AUROC, weighted F1) from training ResNet50 and CMLs on the 2,258-image set and evaluating VLMs within a comparative framework. No equations, derivations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the abstract or described framework. All claims rest on measured outcomes rather than any reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Images and pathology labels form an i.i.d. sample suitable for supervised evaluation
Reference graph
Works this paper leans on
-
[1]
Leufkens, A. M., van Oijen, M. G. H., Vleggaar, F. P. & Siersema, P. D. Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy 44 , 470–475 (2012)
work page 2012
-
[2]
Kim, N. H. et al. Miss rate of colorectal neoplastic polyps and risk factors for missed polyps in consecutive colonoscopies. Intest Res 15 , 411–418 (2017)
work page 2017
-
[3]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 , (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[4]
Szegedy, C. et al. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1–9 (2014) doi:10.48550/arXiv.1409.4842
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.4842 2015
-
[5]
Sunae So and Trevon Badloe and Jaebum Noh and Jorge Bravo-Abad and Junsuk Rho
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016). doi:10.1109/CVPR.2016.90
-
[6]
G. Huang, Z. Liu, L. Van Der Maaten, & K. Q. Weinberger. Densely Connected Convolutional Networks. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (2017). doi:10.1109/CVPR.2017.243. 24
-
[7]
Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. (2021)
work page 2021
-
[8]
OpenAI, Achiam, J., Adler, S., & others. GPT-4 Technical Report. (2024)
work page 2024
- [9]
-
[10]
Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. (2024)
work page 2024
-
[11]
Pillai, A., Parappally, Bs. S. & Hardin, M. J. Evaluating the Diagnostic and Treatment Recommendation Capabilities of GPT-4 Vision in Dermatology. in medRxiv (2024). doi:10.1101/2024.01.24.24301743
-
[12]
Laohawetwanit, T., Namboonlue, C. & Apornvirat, S. Accuracy of GPT-4 in histopathological image detection and classification of colorectal adenomas. J Clin Pathol jcp-2023-209304 (2024) doi:10.1136/jcp-2023-209304
- [13]
-
[14]
Han, T. et al. Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise. medRxiv 2023.11.03.23297957 (2023) doi:10.1101/2023.11.03.23297957
- [15]
-
[16]
Yang, Ms. Z. et al. Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations. in medRxiv (2023). 25 doi:10.1101/2023.10.26.23297629
-
[17]
Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digital Medicine 7 , 190 (2024)
work page 2024
- [18]
-
[19]
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385 , e078378 (2024)
work page 2024
-
[20]
Haumaier, F., Sterlacci, W. & Vieth, M. Histological and molecular classification of gastrointestinal polyps. Best Pract Res Clin Gastroenterol 31 , 369–379 (2017)
work page 2017
-
[21]
Zhang, S. et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. (2025)
work page 2025
- [22]
-
[23]
Schmidl, B. et al. Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. European Archives of Oto-Rhino-Laryngology 281 , 6099–6109 (2024)
work page 2024
-
[24]
Nguyen, C., Carrion, D. & Badawy, M. Comparative Performance of Claude and GPT Models in Basic Radiological Imaging Tasks. medRxiv (2024) doi:10.1101/2024.11.16.24317414
-
[25]
Ishida, M. et al. Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings. Cureus 16 , e67306 26 (2024)
work page 2024
-
[26]
Liu, X. et al. Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis. JMIR Med Inform 12 , e59273 (2024)
work page 2024
-
[27]
Liu, M. et al. Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study. medRxiv (2024) doi:10.1101/2024.07.09.24310129
-
[28]
Chen, Z. et al. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine 87 , 1041–1049 (2025). 27 Conflict of Interests Declaration AlSo serves on the advisory board and holds equity in Virgo Surgical Solutions. The other authors declare no conflicts of interest. Acknowledgments The ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.