pith. sign in

arxiv: 2503.21840 · v1 · submitted 2025-03-27 · 📡 eess.IV · cs.CV

Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images

Pith reviewed 2026-05-22 23:11 UTC · model grok-4.3

classification 📡 eess.IV cs.CV
keywords polyp detectioncolonoscopyvision language modelsconvolutional neural networksCADeCADxBioMedCLIPGPT-4
0
0 comments X

The pith

ResNet50 outperforms VLMs in polyp detection and classification on colonoscopy images, though BioMedCLIP and GPT-4 remain competitive in detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a direct comparison of eleven models on 2,258 colonoscopy images for two clinical tasks: polyp detection (CADe) and polyp classification (CADx). ResNet50 records the highest scores on both tasks, followed by BioMedCLIP in detection and GPT-4 ahead of other general VLMs. The authors conclude that CNNs deliver the strongest results when full supervised training is possible, yet certain VLMs may still be practical when that training cannot be performed. This establishes a performance hierarchy and identifies limited but usable roles for vision-language models in settings with restricted training resources.

Core claim

On the same set of preprocessed colonoscopy images, ResNet50 reached an F1 of 91.35 percent and AUROC of 0.98 for polyp detection and a weighted F1 of 74.94 percent for classification, while BioMedCLIP achieved 88.68 percent F1 in detection and GPT-4 reached 81.02 percent F1 in detection and 41.18 percent weighted F1 in classification, with other VLMs performing lower.

What carries the argument

The standardized comparative framework that applies identical preprocessing and the same detection and classification metrics to ResNet50, four classic machine-learning classifiers, CLIP, BioMedCLIP, and three general-purpose VLMs.

If this is right

  • CNNs such as ResNet50 remain the most accurate option for both polyp detection and classification when supervised training data are available.
  • BioMedCLIP can reach detection performance close to ResNet50 without task-specific fine-tuning.
  • GPT-4 exceeds other general VLMs in both tasks but still trails CNNs and BioMedCLIP.
  • When full CNN training is infeasible, BioMedCLIP or GPT-4 can supply usable detection results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The performance gap between detection and classification suggests VLMs may need domain-specific adaptation to handle fine-grained polyp typing.
  • VLMs could serve as an initial screening layer that flags images for later review by a trained CNN when labeled data are scarce.
  • The current results imply that future VLM improvements in medical imaging may reduce reliance on large annotated datasets for basic detection tasks.

Load-bearing premise

The zero-shot or few-shot prompting used for the VLMs is equivalent in data usage and training effort to the supervised training performed on ResNet50 and the classic machine-learning models.

What would settle it

A re-evaluation in which every model, including the VLMs, is trained or prompted with exactly the same quantity of labeled images and the same compute budget, then measured on the identical test set.

read the original abstract

Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a comparative study of ResNet50, four classic machine learning models, two contrastive vision-language encoders (CLIP, BioMedCLIP), and three general-purpose VLMs (GPT-4, Gemini-1.5-Pro, Claude-3-Opus) on polyp detection (CADe) and classification (CADx) tasks using a dataset of 2,258 colonoscopy images from 428 patients. It reports that ResNet50 achieves the highest performance (F1 91.35% for detection, weighted F1 74.94% for classification), with BioMedCLIP and GPT-4 following, and concludes that CNNs are superior but VLMs may be useful when CNN training is infeasible.

Significance. If the evaluation protocols are shown to be comparable, this work provides empirical evidence on the relative strengths of CNNs versus VLMs in medical image analysis for colonoscopy, highlighting potential practical applications of VLMs in resource-constrained settings. The inclusion of multiple model types and two clinical tasks adds breadth to the comparison.

major comments (2)
  1. [Abstract/Methods] Abstract/Methods: The abstract and framework description do not specify the evaluation protocol for the VLMs (e.g., zero-shot prompting, few-shot examples, or fine-tuning on the 2,258-image dataset or its splits), unlike the explicit training of ResNet50. This omission prevents verification that performance differences reflect model capabilities rather than unequal access to task-specific training data, directly undermining the central claim of CNN superiority under matched conditions.
  2. [Results] Results: No details are provided on train/test splits, handling of class imbalance, or statistical testing (e.g., confidence intervals or p-values) for the reported metrics such as F1 scores and AUROCs, making it difficult to assess the reliability and generalizability of the performance rankings.
minor comments (2)
  1. [Abstract] Abstract: Placeholders [AS1] and [AS2] appear in the reported AUROC values for BioMedCLIP and GPT-4, indicating incomplete reporting.
  2. [Abstract] Abstract: The conclusion states VLMs 'may be useful' without quantifying the conditions under which CNN training is 'not feasible'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight important gaps in methodological transparency. We will revise the manuscript to provide the requested details on VLM evaluation protocols and experimental design elements. Below we respond point by point.

read point-by-point responses
  1. Referee: [Abstract/Methods] Abstract/Methods: The abstract and framework description do not specify the evaluation protocol for the VLMs (e.g., zero-shot prompting, few-shot examples, or fine-tuning on the 2,258-image dataset or its splits), unlike the explicit training of ResNet50. This omission prevents verification that performance differences reflect model capabilities rather than unequal access to task-specific training data, directly undermining the central claim of CNN superiority under matched conditions.

    Authors: We agree the protocol was insufficiently described. The general-purpose VLMs (GPT-4, Gemini-1.5-Pro, Claude-3-Opus) and contrastive encoders (CLIP, BioMedCLIP) were evaluated strictly in a zero-shot setting using carefully designed prompts; no fine-tuning or few-shot examples from the 2,258-image dataset were used. This design choice was deliberate to reflect realistic deployment scenarios where task-specific training data may be unavailable. We will expand the Methods section (and update the abstract) to state this explicitly, including the exact prompts employed, so that the comparison conditions are transparent. revision: yes

  2. Referee: [Results] Results: No details are provided on train/test splits, handling of class imbalance, or statistical testing (e.g., confidence intervals or p-values) for the reported metrics such as F1 scores and AUROCs, making it difficult to assess the reliability and generalizability of the performance rankings.

    Authors: We acknowledge these omissions. The dataset was partitioned at the patient level (80/20 train/test) to prevent leakage across images from the same patient, with class proportions preserved. Class imbalance was mitigated via class-weighted loss for the CNN and CML models; the zero-shot VLMs received no such adjustment. We will add these details to the Methods section and, in Results, report 95% bootstrap confidence intervals for all F1 and AUROC values together with pairwise statistical comparisons (McNemar’s test for detection, Friedman test with post-hoc for classification). revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on fixed dataset

full rationale

The paper reports direct performance metrics (F1, AUROC, weighted F1) from training ResNet50 and CMLs on the 2,258-image set and evaluating VLMs within a comparative framework. No equations, derivations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the abstract or described framework. All claims rest on measured outcomes rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmark study; no free parameters, new axioms, or invented entities are introduced beyond standard machine-learning evaluation assumptions.

axioms (1)
  • domain assumption Images and pathology labels form an i.i.d. sample suitable for supervised evaluation
    The study treats the 2,258 images as a fixed dataset for model comparison without discussing distribution shift or label noise.

pith-pipeline@v0.9.0 · 6003 in / 1238 out tokens · 87299 ms · 2026-05-22T23:11:32.450299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    M., van Oijen, M

    Leufkens, A. M., van Oijen, M. G. H., Vleggaar, F. P. & Siersema, P. D. Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy 44 , 470–475 (2012)

  2. [2]

    Kim, N. H. et al. Miss rate of colorectal neoplastic polyps and risk factors for missed polyps in consecutive colonoscopies. Intest Res 15 , 411–418 (2017)

  3. [3]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 , (2014)

  4. [4]

    Szegedy, C. et al. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1–9 (2014) doi:10.48550/arXiv.1409.4842

  5. [5]

    Sunae So and Trevon Badloe and Jaebum Noh and Jorge Bravo-Abad and Junsuk Rho

    He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016). doi:10.1109/CVPR.2016.90

  6. [6]

    In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

    G. Huang, Z. Liu, L. Van Der Maaten, & K. Q. Weinberger. Densely Connected Convolutional Networks. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (2017). doi:10.1109/CVPR.2017.243. 24

  7. [7]

    Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. (2021)

  8. [8]

    GPT-4 Technical Report

    OpenAI, Achiam, J., Adler, S., & others. GPT-4 Technical Report. (2024)

  9. [9]

    in (2024)

    The Claude 3 Model Family: Opus, Sonnet, Haiku. in (2024)

  10. [10]

    Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. (2024)

  11. [11]

    Pillai, A., Parappally, Bs. S. & Hardin, M. J. Evaluating the Diagnostic and Treatment Recommendation Capabilities of GPT-4 Vision in Dermatology. in medRxiv (2024). doi:10.1101/2024.01.24.24301743

  12. [12]

    & Apornvirat, S

    Laohawetwanit, T., Namboonlue, C. & Apornvirat, S. Accuracy of GPT-4 in histopathological image detection and classification of colorectal adenomas. J Clin Pathol jcp-2023-209304 (2024) doi:10.1136/jcp-2023-209304

  13. [13]

    Chen, R. et al. GPT-4 Vision on Medical Image Classification - A Case Study on COVID-19 Dataset. ArXiv abs/2310.18498 , (2023)

  14. [14]

    Han, T. et al. Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise. medRxiv 2023.11.03.23297957 (2023) doi:10.1101/2023.11.03.23297957

  15. [15]

    & Shi, D

    Xu, P., Chen, X., Zhao, Z. & Shi, D. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br J Ophthalmol 108 , 1384–1389 (2024)

  16. [16]

    Yang, Ms. Z. et al. Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations. in medRxiv (2023). 25 doi:10.1101/2023.10.26.23297629

  17. [17]

    Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digital Medicine 7 , 190 (2024)

  18. [18]

    & Emam, K

    Klement, W. & Emam, K. E. Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Modeling Studies: Development and Validation. Journal of Medical Internet Research 25 , e48763 (2023)

  19. [19]

    Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385 , e078378 (2024)

  20. [20]

    & Vieth, M

    Haumaier, F., Sterlacci, W. & Vieth, M. Histological and molecular classification of gastrointestinal polyps. Best Pract Res Clin Gastroenterol 31 , 369–379 (2017)

  21. [21]

    Zhang, S. et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. (2025)

  22. [22]

    GPT-4V(ision) System Card

    OpenAI. GPT-4V(ision) System Card. in (2023)

  23. [23]

    Schmidl, B. et al. Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. European Archives of Oto-Rhino-Laryngology 281 , 6099–6109 (2024)

  24. [24]

    & Badawy, M

    Nguyen, C., Carrion, D. & Badawy, M. Comparative Performance of Claude and GPT Models in Basic Radiological Imaging Tasks. medRxiv (2024) doi:10.1101/2024.11.16.24317414

  25. [25]

    Ishida, M. et al. Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings. Cureus 16 , e67306 26 (2024)

  26. [26]

    Liu, X. et al. Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis. JMIR Med Inform 12 , e59273 (2024)

  27. [27]

    Liu, M. et al. Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study. medRxiv (2024) doi:10.1101/2024.07.09.24310129

  28. [28]

    Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images

    Chen, Z. et al. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine 87 , 1041–1049 (2025). 27 Conflict of Interests Declaration AlSo serves on the advisory board and holds equity in Virgo Surgical Solutions. The other authors declare no conflicts of interest. Acknowledgments The ...