Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images

Ali Soroush; Ameneh Salehi; Dorsa Alijanzadeh; Girish Nadkarni; Hamid Asadzadeh Aghdaei; Jamil S Samaan; Kaveh Kavosi; Mohammad Amin Khalafi; Nariman Naderi; Negar Golestani

arxiv: 2503.21840 · v1 · submitted 2025-03-27 · 📡 eess.IV · cs.CV

Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images

Mohammad Amin Khalafi , Seyed Amir Ahmad Safavi-Naini , Ameneh Salehi , Nariman Naderi , Dorsa Alijanzadeh , Pardis Ketabi Moghadam , Kaveh Kavosi , Negar Golestani

show 8 more authors

Shabnam Shahrokh Soltanali Fallah Jamil S Samaan Nicholas P. Tatonetti Nicholas Hoerter Girish Nadkarni Hamid Asadzadeh Aghdaei Ali Soroush

This is my paper

Pith reviewed 2026-05-22 23:11 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords polyp detectioncolonoscopyvision language modelsconvolutional neural networksCADeCADxBioMedCLIPGPT-4

0 comments

The pith

ResNet50 outperforms VLMs in polyp detection and classification on colonoscopy images, though BioMedCLIP and GPT-4 remain competitive in detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a direct comparison of eleven models on 2,258 colonoscopy images for two clinical tasks: polyp detection (CADe) and polyp classification (CADx). ResNet50 records the highest scores on both tasks, followed by BioMedCLIP in detection and GPT-4 ahead of other general VLMs. The authors conclude that CNNs deliver the strongest results when full supervised training is possible, yet certain VLMs may still be practical when that training cannot be performed. This establishes a performance hierarchy and identifies limited but usable roles for vision-language models in settings with restricted training resources.

Core claim

On the same set of preprocessed colonoscopy images, ResNet50 reached an F1 of 91.35 percent and AUROC of 0.98 for polyp detection and a weighted F1 of 74.94 percent for classification, while BioMedCLIP achieved 88.68 percent F1 in detection and GPT-4 reached 81.02 percent F1 in detection and 41.18 percent weighted F1 in classification, with other VLMs performing lower.

What carries the argument

The standardized comparative framework that applies identical preprocessing and the same detection and classification metrics to ResNet50, four classic machine-learning classifiers, CLIP, BioMedCLIP, and three general-purpose VLMs.

If this is right

CNNs such as ResNet50 remain the most accurate option for both polyp detection and classification when supervised training data are available.
BioMedCLIP can reach detection performance close to ResNet50 without task-specific fine-tuning.
GPT-4 exceeds other general VLMs in both tasks but still trails CNNs and BioMedCLIP.
When full CNN training is infeasible, BioMedCLIP or GPT-4 can supply usable detection results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The performance gap between detection and classification suggests VLMs may need domain-specific adaptation to handle fine-grained polyp typing.
VLMs could serve as an initial screening layer that flags images for later review by a trained CNN when labeled data are scarce.
The current results imply that future VLM improvements in medical imaging may reduce reliance on large annotated datasets for basic detection tasks.

Load-bearing premise

The zero-shot or few-shot prompting used for the VLMs is equivalent in data usage and training effort to the supervised training performed on ResNet50 and the classic machine-learning models.

What would settle it

A re-evaluation in which every model, including the VLMs, is trained or prompted with exactly the same quantity of labeled images and the same compute budget, then measured on the identical test set.

read the original abstract

Introduction: This study provides a comprehensive performance assessment of vision-language models (VLMs) against established convolutional neural networks (CNNs) and classic machine learning models (CMLs) for computer-aided detection (CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method: We analyzed 2,258 colonoscopy images with corresponding pathology reports from 428 patients. We preprocessed all images using standardized techniques (resizing, normalization, and augmentation) and implemented a rigorous comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random forest, support vector machine, logistic regression, decision tree), two specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance assessment focused on two clinical tasks: polyp detection (CADe) and classification (CADx). Result: In polyp detection, ResNet50 achieved the best performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%, AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other general-purpose VLMs. For polyp classification, performance rankings remained consistent but with lower overall metrics. ResNet50 maintained the highest efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability (weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and GPT-4 may be useful for polyp detection tasks where training CNNs is not feasible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's claim that CNNs are superior rests on an unverified assumption that VLMs received comparable training data and effort.

read the letter

ResNet50 beats the VLMs on polyp detection and classification in this 2258-image set, but the results don't establish that the gap comes from model class rather than unequal access to training data. The abstract reports ResNet50 trained on the full collection while describing the VLMs only as part of a comparative framework, without stating zero-shot, few-shot, or fine-tuning details. That leaves the central comparison hard to interpret. BiomedCLIP reaches 88.68% F1 on detection and GPT-4 hits 81%, both behind ResNet50's 91.35%, with larger gaps on the classification task. The numbers themselves are concrete and the dataset size from 428 patients is reasonable for this domain. The work also includes both specialized encoders and general-purpose models, which gives a slightly wider view than single-VLM studies. The main limitation is the missing protocol information. Without knowing exactly how much task-specific data or optimization the VLMs saw, the conclusion that CNNs remain superior and VLMs are only backups when training is impossible does not follow from the evidence shown. Standard details like train-test splits, cross-validation, or handling of imbalance are also absent, which weakens confidence in the reported metrics. This is a narrow benchmark paper. Readers running medical imaging experiments might find the raw scores useful for quick reference, but it adds no new method or theoretical point. I would send it for peer review if the methods section supplies the VLM evaluation setup and basic experimental controls; otherwise the main claim stays under-supported.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a comparative study of ResNet50, four classic machine learning models, two contrastive vision-language encoders (CLIP, BioMedCLIP), and three general-purpose VLMs (GPT-4, Gemini-1.5-Pro, Claude-3-Opus) on polyp detection (CADe) and classification (CADx) tasks using a dataset of 2,258 colonoscopy images from 428 patients. It reports that ResNet50 achieves the highest performance (F1 91.35% for detection, weighted F1 74.94% for classification), with BioMedCLIP and GPT-4 following, and concludes that CNNs are superior but VLMs may be useful when CNN training is infeasible.

Significance. If the evaluation protocols are shown to be comparable, this work provides empirical evidence on the relative strengths of CNNs versus VLMs in medical image analysis for colonoscopy, highlighting potential practical applications of VLMs in resource-constrained settings. The inclusion of multiple model types and two clinical tasks adds breadth to the comparison.

major comments (2)

[Abstract/Methods] Abstract/Methods: The abstract and framework description do not specify the evaluation protocol for the VLMs (e.g., zero-shot prompting, few-shot examples, or fine-tuning on the 2,258-image dataset or its splits), unlike the explicit training of ResNet50. This omission prevents verification that performance differences reflect model capabilities rather than unequal access to task-specific training data, directly undermining the central claim of CNN superiority under matched conditions.
[Results] Results: No details are provided on train/test splits, handling of class imbalance, or statistical testing (e.g., confidence intervals or p-values) for the reported metrics such as F1 scores and AUROCs, making it difficult to assess the reliability and generalizability of the performance rankings.

minor comments (2)

[Abstract] Abstract: Placeholders [AS1] and [AS2] appear in the reported AUROC values for BioMedCLIP and GPT-4, indicating incomplete reporting.
[Abstract] Abstract: The conclusion states VLMs 'may be useful' without quantifying the conditions under which CNN training is 'not feasible'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight important gaps in methodological transparency. We will revise the manuscript to provide the requested details on VLM evaluation protocols and experimental design elements. Below we respond point by point.

read point-by-point responses

Referee: [Abstract/Methods] Abstract/Methods: The abstract and framework description do not specify the evaluation protocol for the VLMs (e.g., zero-shot prompting, few-shot examples, or fine-tuning on the 2,258-image dataset or its splits), unlike the explicit training of ResNet50. This omission prevents verification that performance differences reflect model capabilities rather than unequal access to task-specific training data, directly undermining the central claim of CNN superiority under matched conditions.

Authors: We agree the protocol was insufficiently described. The general-purpose VLMs (GPT-4, Gemini-1.5-Pro, Claude-3-Opus) and contrastive encoders (CLIP, BioMedCLIP) were evaluated strictly in a zero-shot setting using carefully designed prompts; no fine-tuning or few-shot examples from the 2,258-image dataset were used. This design choice was deliberate to reflect realistic deployment scenarios where task-specific training data may be unavailable. We will expand the Methods section (and update the abstract) to state this explicitly, including the exact prompts employed, so that the comparison conditions are transparent. revision: yes
Referee: [Results] Results: No details are provided on train/test splits, handling of class imbalance, or statistical testing (e.g., confidence intervals or p-values) for the reported metrics such as F1 scores and AUROCs, making it difficult to assess the reliability and generalizability of the performance rankings.

Authors: We acknowledge these omissions. The dataset was partitioned at the patient level (80/20 train/test) to prevent leakage across images from the same patient, with class proportions preserved. Class imbalance was mitigated via class-weighted loss for the CNN and CML models; the zero-shot VLMs received no such adjustment. We will add these details to the Methods section and, in Results, report 95% bootstrap confidence intervals for all F1 and AUROC values together with pairwise statistical comparisons (McNemar’s test for detection, Friedman test with post-hoc for classification). revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on fixed dataset

full rationale

The paper reports direct performance metrics (F1, AUROC, weighted F1) from training ResNet50 and CMLs on the 2,258-image set and evaluating VLMs within a comparative framework. No equations, derivations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the abstract or described framework. All claims rest on measured outcomes rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmark study; no free parameters, new axioms, or invented entities are introduced beyond standard machine-learning evaluation assumptions.

axioms (1)

domain assumption Images and pathology labels form an i.i.d. sample suitable for supervised evaluation
The study treats the 2,258 images as a fixed dataset for model comparison without discussing distribution shift or label noise.

pith-pipeline@v0.9.0 · 6003 in / 1238 out tokens · 87299 ms · 2026-05-22T23:11:32.450299+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

M., van Oijen, M

Leufkens, A. M., van Oijen, M. G. H., Vleggaar, F. P. & Siersema, P. D. Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy 44 , 470–475 (2012)

work page 2012
[2]

Kim, N. H. et al. Miss rate of colorectal neoplastic polyps and risk factors for missed polyps in consecutive colonoscopies. Intest Res 15 , 411–418 (2017)

work page 2017
[3]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 , (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

Szegedy, C. et al. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1–9 (2014) doi:10.48550/arXiv.1409.4842

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.4842 2015
[5]

Sunae So and Trevon Badloe and Jaebum Noh and Jorge Bravo-Abad and Junsuk Rho

He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016). doi:10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[6]

In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

G. Huang, Z. Liu, L. Van Der Maaten, & K. Q. Weinberger. Densely Connected Convolutional Networks. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (2017). doi:10.1109/CVPR.2017.243. 24

work page doi:10.1109/cvpr.2017.243 2017
[7]

Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. (2021)

work page 2021
[8]

GPT-4 Technical Report

OpenAI, Achiam, J., Adler, S., & others. GPT-4 Technical Report. (2024)

work page 2024
[9]

in (2024)

The Claude 3 Model Family: Opus, Sonnet, Haiku. in (2024)

work page 2024
[10]

Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. (2024)

work page 2024
[11]

Pillai, A., Parappally, Bs. S. & Hardin, M. J. Evaluating the Diagnostic and Treatment Recommendation Capabilities of GPT-4 Vision in Dermatology. in medRxiv (2024). doi:10.1101/2024.01.24.24301743

work page doi:10.1101/2024.01.24.24301743 2024
[12]

& Apornvirat, S

Laohawetwanit, T., Namboonlue, C. & Apornvirat, S. Accuracy of GPT-4 in histopathological image detection and classification of colorectal adenomas. J Clin Pathol jcp-2023-209304 (2024) doi:10.1136/jcp-2023-209304

work page doi:10.1136/jcp-2023-209304 2023
[13]

Chen, R. et al. GPT-4 Vision on Medical Image Classification - A Case Study on COVID-19 Dataset. ArXiv abs/2310.18498 , (2023)

work page arXiv 2023
[14]

Han, T. et al. Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise. medRxiv 2023.11.03.23297957 (2023) doi:10.1101/2023.11.03.23297957

work page doi:10.1101/2023.11.03.23297957 2023
[15]

& Shi, D

Xu, P., Chen, X., Zhao, Z. & Shi, D. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br J Ophthalmol 108 , 1384–1389 (2024)

work page 2024
[16]

Yang, Ms. Z. et al. Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations. in medRxiv (2023). 25 doi:10.1101/2023.10.26.23297629

work page doi:10.1101/2023.10.26.23297629 2023
[17]

Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digital Medicine 7 , 190 (2024)

work page 2024
[18]

& Emam, K

Klement, W. & Emam, K. E. Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Modeling Studies: Development and Validation. Journal of Medical Internet Research 25 , e48763 (2023)

work page 2023
[19]

Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385 , e078378 (2024)

work page 2024
[20]

& Vieth, M

Haumaier, F., Sterlacci, W. & Vieth, M. Histological and molecular classification of gastrointestinal polyps. Best Pract Res Clin Gastroenterol 31 , 369–379 (2017)

work page 2017
[21]

Zhang, S. et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. (2025)

work page 2025
[22]

GPT-4V(ision) System Card

OpenAI. GPT-4V(ision) System Card. in (2023)

work page 2023
[23]

Schmidl, B. et al. Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. European Archives of Oto-Rhino-Laryngology 281 , 6099–6109 (2024)

work page 2024
[24]

& Badawy, M

Nguyen, C., Carrion, D. & Badawy, M. Comparative Performance of Claude and GPT Models in Basic Radiological Imaging Tasks. medRxiv (2024) doi:10.1101/2024.11.16.24317414

work page doi:10.1101/2024.11.16.24317414 2024
[25]

Ishida, M. et al. Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings. Cureus 16 , e67306 26 (2024)

work page 2024
[26]

Liu, X. et al. Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis. JMIR Med Inform 12 , e59273 (2024)

work page 2024
[27]

Liu, M. et al. Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study. medRxiv (2024) doi:10.1101/2024.07.09.24310129

work page doi:10.1101/2024.07.09.24310129 2024
[28]

Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images

Chen, Z. et al. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine 87 , 1041–1049 (2025). 27 Conflict of Interests Declaration AlSo serves on the advisory board and holds equity in Virgo Surgical Solutions. The other authors declare no conflicts of interest. Acknowledgments The ...

work page 2025

[1] [1]

M., van Oijen, M

Leufkens, A. M., van Oijen, M. G. H., Vleggaar, F. P. & Siersema, P. D. Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy 44 , 470–475 (2012)

work page 2012

[2] [2]

Kim, N. H. et al. Miss rate of colorectal neoplastic polyps and risk factors for missed polyps in consecutive colonoscopies. Intest Res 15 , 411–418 (2017)

work page 2017

[3] [3]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 , (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[4] [4]

Szegedy, C. et al. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1–9 (2014) doi:10.48550/arXiv.1409.4842

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.4842 2015

[5] [5]

Sunae So and Trevon Badloe and Jaebum Noh and Jorge Bravo-Abad and Junsuk Rho

He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016). doi:10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[6] [6]

In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

G. Huang, Z. Liu, L. Van Der Maaten, & K. Q. Weinberger. Densely Connected Convolutional Networks. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (2017). doi:10.1109/CVPR.2017.243. 24

work page doi:10.1109/cvpr.2017.243 2017

[7] [7]

Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. (2021)

work page 2021

[8] [8]

GPT-4 Technical Report

OpenAI, Achiam, J., Adler, S., & others. GPT-4 Technical Report. (2024)

work page 2024

[9] [9]

in (2024)

The Claude 3 Model Family: Opus, Sonnet, Haiku. in (2024)

work page 2024

[10] [10]

Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. (2024)

work page 2024

[11] [11]

Pillai, A., Parappally, Bs. S. & Hardin, M. J. Evaluating the Diagnostic and Treatment Recommendation Capabilities of GPT-4 Vision in Dermatology. in medRxiv (2024). doi:10.1101/2024.01.24.24301743

work page doi:10.1101/2024.01.24.24301743 2024

[12] [12]

& Apornvirat, S

Laohawetwanit, T., Namboonlue, C. & Apornvirat, S. Accuracy of GPT-4 in histopathological image detection and classification of colorectal adenomas. J Clin Pathol jcp-2023-209304 (2024) doi:10.1136/jcp-2023-209304

work page doi:10.1136/jcp-2023-209304 2023

[13] [13]

Chen, R. et al. GPT-4 Vision on Medical Image Classification - A Case Study on COVID-19 Dataset. ArXiv abs/2310.18498 , (2023)

work page arXiv 2023

[14] [14]

Han, T. et al. Comparative Analysis of GPT-4Vision, GPT-4 and Open Source LLMs in Clinical Diagnostic Accuracy: A Benchmark Against Human Expertise. medRxiv 2023.11.03.23297957 (2023) doi:10.1101/2023.11.03.23297957

work page doi:10.1101/2023.11.03.23297957 2023

[15] [15]

& Shi, D

Xu, P., Chen, X., Zhao, Z. & Shi, D. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br J Ophthalmol 108 , 1384–1389 (2024)

work page 2024

[16] [16]

Yang, Ms. Z. et al. Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations. in medRxiv (2023). 25 doi:10.1101/2023.10.26.23297629

work page doi:10.1101/2023.10.26.23297629 2023

[17] [17]

Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digital Medicine 7 , 190 (2024)

work page 2024

[18] [18]

& Emam, K

Klement, W. & Emam, K. E. Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Modeling Studies: Development and Validation. Journal of Medical Internet Research 25 , e48763 (2023)

work page 2023

[19] [19]

Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385 , e078378 (2024)

work page 2024

[20] [20]

& Vieth, M

Haumaier, F., Sterlacci, W. & Vieth, M. Histological and molecular classification of gastrointestinal polyps. Best Pract Res Clin Gastroenterol 31 , 369–379 (2017)

work page 2017

[21] [21]

Zhang, S. et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. (2025)

work page 2025

[22] [22]

GPT-4V(ision) System Card

OpenAI. GPT-4V(ision) System Card. in (2023)

work page 2023

[23] [23]

Schmidl, B. et al. Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. European Archives of Oto-Rhino-Laryngology 281 , 6099–6109 (2024)

work page 2024

[24] [24]

& Badawy, M

Nguyen, C., Carrion, D. & Badawy, M. Comparative Performance of Claude and GPT Models in Basic Radiological Imaging Tasks. medRxiv (2024) doi:10.1101/2024.11.16.24317414

work page doi:10.1101/2024.11.16.24317414 2024

[25] [25]

Ishida, M. et al. Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings. Cureus 16 , e67306 26 (2024)

work page 2024

[26] [26]

Liu, X. et al. Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis. JMIR Med Inform 12 , e59273 (2024)

work page 2024

[27] [27]

Liu, M. et al. Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study. medRxiv (2024) doi:10.1101/2024.07.09.24310129

work page doi:10.1101/2024.07.09.24310129 2024

[28] [28]

Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images

Chen, Z. et al. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine 87 , 1041–1049 (2025). 27 Conflict of Interests Declaration AlSo serves on the advisory board and holds equity in Virgo Surgical Solutions. The other authors declare no conflicts of interest. Acknowledgments The ...

work page 2025