Recognition: 2 theorem links
· Lean TheoremMedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images
Pith reviewed 2026-05-16 19:43 UTC · model grok-4.3
The pith
A fine-tuned open-source MedGemma model outperforms untuned GPT-4 at classifying six diseases from medical images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), achieves a mean test accuracy of 80.37 percent in classifying six diseases from medical images, outperforming the untuned GPT-4 at 69.58 percent. It also shows higher sensitivity in high-stakes tasks including cancer and pneumonia detection. Quantitative analysis through confusion matrices and classification reports supplies detailed performance insights across categories. These outcomes indicate that domain-specific fine-tuning is required to minimize hallucinations and support evidence-based medical reasoning.
What carries the argument
LoRA fine-tuning applied to the MedGemma-4b-it multimodal model, contrasted with direct use of untuned GPT-4, measured by accuracy, sensitivity, and per-class metrics on a fixed test set of images.
If this is right
- Domain-specific fine-tuning produces higher diagnostic accuracy than direct application of general models.
- Open-source multimodal models can be adapted to exceed proprietary general models on specialized medical tasks.
- Elevated sensitivity for cancer and pneumonia supports earlier detection of critical conditions.
- Confusion-matrix analysis reveals category-specific error patterns that guide further model refinement.
- Reduced hallucinations through fine-tuning improves suitability for clinical decision support.
Where Pith is reading between the lines
- Healthcare systems might gain more by fine-tuning smaller open models on local data than by relying on large proprietary APIs.
- The performance gap could narrow if GPT-4 receives equivalent domain adaptation or prompt optimization, an experiment the paper leaves open.
- Results encourage creation of similar fine-tuned models for additional imaging modalities such as pathology slides or ultrasound.
- If confirmed on larger varied datasets, these models could serve as accessible diagnostic aids in resource-limited settings.
Load-bearing premise
The test dataset mirrors real clinical cases and the head-to-head comparison remains fair despite differences in original training data and the absence of prompt engineering for GPT-4.
What would settle it
A new evaluation on an independent multi-hospital image dataset, or a re-test of GPT-4 with optimized prompts, that shows GPT-4 matching or exceeding 80.37 percent accuracy and equal sensitivity on cancer and pneumonia.
Figures
read the original abstract
Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares MedGemma-4b-it (fine-tuned via LoRA) against untuned GPT-4 on zero-shot classification of six medical diseases from images. It reports a mean test accuracy of 80.37% for the fine-tuned MedGemma versus 69.58% for GPT-4, with higher sensitivity for MedGemma on cancer and pneumonia detection, supported by confusion matrices and classification reports, and concludes that domain-specific fine-tuning is required to reduce hallucinations in clinical use.
Significance. If the empirical comparison can be shown to be fair and the test set representative, the result would supply concrete evidence that LoRA fine-tuning of a smaller domain-specific multimodal model can outperform a much larger general-purpose model on medical imaging tasks, informing choices between open-source fine-tuning and proprietary zero-shot inference in clinical AI.
major comments (2)
- [Abstract / Results] Abstract and Results section: the headline accuracies (80.37 % vs 69.58 %) are presented without any report of test-set size, class balance, construction method, or statistical significance testing (e.g., paired test or bootstrap CI), leaving the central performance claim without verifiable quantitative support.
- [Methods] Methods section: the exact prompt template and any prompt-engineering steps applied to the untuned GPT-4 are not documented, nor is any membership-inference or data-leakage analysis provided; without these controls the attribution of the 10.8-point gap to LoRA fine-tuning rather than prompt disparity or training-data overlap cannot be secured.
minor comments (2)
- [Abstract] The abstract describes the GPT-4 evaluation as 'zero-shot' while MedGemma is explicitly fine-tuned; a brief clarification of the differing evaluation regimes would improve readability.
- [Figures] Figure captions for the confusion matrices should state the exact number of test samples per class to allow readers to interpret the reported sensitivities.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We provide point-by-point responses to the major comments and describe the planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results section: the headline accuracies (80.37 % vs 69.58 %) are presented without any report of test-set size, class balance, construction method, or statistical significance testing (e.g., paired test or bootstrap CI), leaving the central performance claim without verifiable quantitative support.
Authors: We agree that these supporting details are important for the credibility of our results. The revised manuscript will include the test-set size, information on class balance and construction method, and statistical significance testing (e.g., bootstrap confidence intervals) in the Results section to provide verifiable quantitative support for the reported accuracies. revision: yes
-
Referee: [Methods] Methods section: the exact prompt template and any prompt-engineering steps applied to the untuned GPT-4 are not documented, nor is any membership-inference or data-leakage analysis provided; without these controls the attribution of the 10.8-point gap to LoRA fine-tuning rather than prompt disparity or training-data overlap cannot be secured.
Authors: We will add the exact prompt template and details of any prompt-engineering steps for GPT-4 to the Methods section. We argue that the performance gap is due to the LoRA fine-tuning of MedGemma, as the same test set and comparable prompting strategies were used for both models. A membership-inference or data-leakage analysis was not performed in the original work. revision: partial
- Membership-inference or data-leakage analysis
Circularity Check
No circularity: results are direct empirical measurements
full rationale
The manuscript reports measured test accuracies (80.37 % for LoRA-tuned MedGemma-4b-it versus 69.58 % for untuned GPT-4) obtained by running both models on a fixed test set of images for six disease classes. No equations, parameter-fitting steps, or self-referential definitions appear in the provided text; the headline numbers are not derived from any model-internal quantity that was itself fitted to the same data. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the comparison. The central claim therefore remains an independent empirical observation rather than a quantity forced by construction from the paper's own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- LoRA rank and scaling factors
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
International Diabetes Federation, “IDF Diabetes Atlas.” 2019. [Online]. Available: https://www.diabetesatlas.org
work page 2019
-
[2]
World Health Organization, “Breast Cancer Fact Sheet.” 2020. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/breast-cancer
work page 2020
-
[3]
World Health Organization, “Pneumonia Fact Sheet.” 2021. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/pneumonia
work page 2021
-
[4]
2024 Alzheimer's disease facts and figures,
Alzheimer's Association, "2024 Alzheimer's disease facts and figures," Alzheimer's & Dementia, vol. 20, no. 5, pp. 3708–3821, Apr. 2024. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC11095490/
work page 2024
-
[5]
Cardiovascular Diseases Fact Sheet
World Health Organization, “Cardiovascular Diseases Fact Sheet.” 2019. [Online]. Available: https://www.who.int/news -room/fact- sheets/detail/cardiovascular-diseases-(cvds)
work page 2019
-
[6]
K. Xie et al., "Global, regional, and national burden inequality of chronic kidney disease, 1990 –2021: a systematic analysis for the global burden of disease study 2021," Front. Med., vol. 11, Feb. 2024. [Online]. Available: https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2024. 1501175/full
-
[7]
Deep Learning-Based Detection of Arrhythmia Using ECG Signals – A Comprehensive Review,
A. Reshad, V. Nino, and M. Valero, “Deep Learning-Based Detection of Arrhythmia Using ECG Signals – A Comprehensive Review,” Vasc. Health Risk Manag. , vol. Volume 21, pp. 685 –703, Aug. 2025, doi: 10.2147/VHRM.S508620
-
[8]
Large Language Models Encode Clinical Knowledge,
K. Singhal et al., “Large Language Models Encode Clinical Knowledge,” Dec. 26, 2022, arXiv: arXiv:2212.13138. doi: 10.48550/arXiv.2212.13138
-
[9]
Towards Expert-Level Medical Question Answering with Large Language Models,
K. Singhal et al., “Towards Expert-Level Medical Question Answering with Large Language Models,” May 16, 2023, arXiv: arXiv:2305.09617. doi: 10.48550/arXiv.2305.09617
-
[10]
H. Takita et al., “A systematic review and meta -analysis of diagnostic performance comparison between generative AI and physicians,” Npj Digit. Med., vol. 8, no. 1, p. 175, Mar. 2025, doi: 10.1038/s41746-025-01543-z
-
[11]
S. Abbas et al., “THE ROLE OF ARTIFICIAL INTELLIGENCE IN PERSONALIZED MEDICINE AND PREDICTIVE DIAGNOSTICS – A NARRATIVE REVIEW,” Insights-J. Health Rehabil. , vol. 3, no. 1 (Health&Allied), pp. 624–631, Feb. 2025, doi: 10.71000/k6cga886
-
[12]
Evaluation and mitigation of the limitations of large language models in clinical decision-making,
P. Hager et al., “Evaluation and mitigation of the limitations of large language models in clinical decision-making,” Nat. Med., vol. 30, no. 9, pp. 2613–2622, Sept. 2024, doi: 10.1038/s41591-024-03097-1
-
[14]
Language models are few-shot learners,
T. B. Brown et al., “Language models are few-shot learners,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst. (NIPS '20) , Vancouver, BC, Canada, 2020, Art. no. 159, doi: 10.5555/3495724.3495883
-
[15]
Weak convergence and tightness of probability measures in an abstract Skorohod space
A. Radford et al. , “Language Models are Unsupervised Multitask Learners,” arXiv e -prints, 2019, Art. no. 1907.10522, doi: 10.48550/arXiv.1907.10522. [Online]. Available: https://arxiv.org/abs/1907.10522
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.10522 2019
-
[16]
D. M. Korngiebel and S. D. Mooney, “Considering the possibilities and pitfalls of Generative Pre -trained Transformer 3 (GPT -3) in healthcare delivery,” npj Digit. Med. , vol. 4, no. 1, Art. no. 93, 2021, doi: 10.1038/s41746-021-00464-x
-
[17]
E. Ullah et al., “Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology – a recent scoping review,” Diagn Pathol, vol. 19, no. 1, Art. no. 43, 2024, doi: 10.1186/s13000-024-01464-7
-
[18]
A. Lecler, L. Duron, and P. Soyer, “Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT,” Diagn. Interv. Imaging, vol. 104, no. 6, pp. 269 –274, 2023, doi: 10.1016/j.diii.2023.02.003
-
[19]
Large language models encode clinical knowledge,
K. Singhal et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172 –180, 2023, doi: 10.1038/s41586 -023- 06291-2
-
[20]
BioBERT: a pre -trained biomedical language representation model for biomedical text mining,
J. Lee et al., “BioBERT: a pre -trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2019, doi: 10.1093/bioinformatics/btz682
-
[21]
Publicly Available Clinical BERT Embeddings
E. Alsentzer et al., “Publicly Available Clinical BERT Embeddings,” arXiv e-prints, 2019, Art. no. 1904.03323, doi: 10.48550/arXiv.1904.03323. [Online]. Available: https://arxiv.org/abs/1904.03323
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.03323 2019
-
[22]
A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, et al., “MedGemma Technical Report,” arXiv e -prints, arXiv:2507.05201 [cs.AI], 2025. [Online]. Available: https://arxiv.org/abs/2507.05201. Journal name 9 Author, Title
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
OpenAI et al., “GPT -4 Technical Report,” arXiv preprint arXiv:2303.08774, 2024. [Online]. Available: https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Explainable AI applications in the Medical Domain: a systematic review,
N. Prentzas, A. Kakas, and C. S. Pattichis, “Explainable AI applications in the Medical Domain: a systematic review,” arXiv e-prints, 2023, Art. no. 2308.05411, doi: 10.48550/arXiv.2308.05411. [Online]. Available: https://arxiv.org/abs/2308.05411
-
[25]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” International Conference on Learning Representations, May 2021, [Online]. Available: https://openreview.net/pdf?id=YicbFdNTTy
work page 2021
-
[26]
DeepVIT: Towards Deeper Vision Transformer,
D. Zhou et al., “DeepVIT: Towards Deeper Vision Transformer,” arXiv.org, Mar. 22, 2021. https://arxiv.org/abs/2103.11886
-
[27]
Y. Xie, “New chapter in pediatric medicine: technological evolution, application, and evaluation system of large language models,” European Journal of Pediatrics, vol. 184, no. 12, p. 809, Dec. 2025, doi: 10.1007/s00431-025-06602-x
-
[28]
P. Tschandl, C. Rosendahl, and H. Kittler, “The HAM10000 dataset, a large collection of multi -source dermatoscopic images of common pigmented skin lesions,” Sci Data, vol. 5, no. 1, Art. no. 180161, 2018, doi: 10.1038/sdata.2018.161
-
[29]
D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris, and R. L. Buckner, “Open access series of imaging studies (OASIS): Cross - sectional MRI data in young, middle aged, nondemented, and demented older adults,” Journal of Cognitive Neuroscien ce, vol. 19, no. 9, pp. 1498 – 1507, 2007. doi: 10.1162/jocn.2007.19.9.1498. [Online]. Available: ...
-
[30]
A curated mammography data set for use in computer -aided detection and diagnosis research,
R. S. Lee, F. Gimenez, A. Hoogi, K. K. Miyake, M. Gorovoy, and D. L. Rubin, “A curated mammography data set for use in computer -aided detection and diagnosis research,” Scientific Data, vol. 4, p. 170177, 2017. doi: 10.1038/sdata.2017.177. [Online]. Availa ble: https://doi.org/10.1038/sdata.2017.177
-
[31]
ECG Images Dataset of Cardiac Patients,
A. H. Khan and M. Hussain, “ECG Images Dataset of Cardiac Patients,” Mendeley Data, V2, 2021. doi: 10.17632/gwbz3fsgp8.2. [Online]. Available: https://doi.org/10.17632/gwbz3fsgp8.2
-
[32]
Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification,
D. Kermany, K. Zhang, and M. Goldbaum, “Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification,” Mendeley Data, V2, 2018, doi: 10.17632/rscbjbr9sj.2
-
[33]
CT KIDNEY DATASET: Normal -Cyst-Tumor and Stone,
N. M. Islam, “CT KIDNEY DATASET: Normal -Cyst-Tumor and Stone,” Kaggle, 2021. [Online]. Available: https://www.kaggle.com/datasets/nazmul0087/ct-kidney-dataset-normal- cyst-tumor-and-stone. [Accessed: Oct. 22, 2025]
work page 2021
-
[34]
LoRA: Low-Rank Adaptation of Large Language Models
E. J. Hu et al., “LoRA: Low -Rank Adaptation of Large Language Models,” arXiv e -prints, 2021, Art. no. 2106.09685, doi: 10.48550/arXiv.2106.09685. [Online]. Available: https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021
-
[35]
Towards Understanding Convergence and Generalization of AdamW,
P. Zhou, X. Xie, Z. Lin, and S. Yan, “Towards Understanding Convergence and Generalization of AdamW,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 46, no. 9, pp. 6486 –6493, Sept. 2024, doi: 10.1109/TPAMI.2024.3382294
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.