pith. machine review for the scientific record. sign in

arxiv: 2512.23304 · v1 · submitted 2025-12-29 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords medical imagingdisease classificationmultimodal LLMsfine-tuningLoRAMedGemmaGPT-4zero-shot learning
0
0 comments X

The pith

A fine-tuned open-source MedGemma model outperforms untuned GPT-4 at classifying six diseases from medical images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares a domain-adapted open-source multimodal model against a general proprietary one for zero-shot disease diagnosis from images. MedGemma-4b-it, after Low-Rank Adaptation fine-tuning, reaches 80.37 percent mean test accuracy while GPT-4 reaches 69.58 percent, with notably higher sensitivity on cancer and pneumonia cases. Evaluation uses confusion matrices and classification reports to break down performance across all six categories. The work argues that domain-specific adaptation reduces clinical errors such as hallucinations. Readers would care because it suggests open models can be made reliable for medical tasks through targeted tuning rather than scale alone.

Core claim

The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), achieves a mean test accuracy of 80.37 percent in classifying six diseases from medical images, outperforming the untuned GPT-4 at 69.58 percent. It also shows higher sensitivity in high-stakes tasks including cancer and pneumonia detection. Quantitative analysis through confusion matrices and classification reports supplies detailed performance insights across categories. These outcomes indicate that domain-specific fine-tuning is required to minimize hallucinations and support evidence-based medical reasoning.

What carries the argument

LoRA fine-tuning applied to the MedGemma-4b-it multimodal model, contrasted with direct use of untuned GPT-4, measured by accuracy, sensitivity, and per-class metrics on a fixed test set of images.

If this is right

  • Domain-specific fine-tuning produces higher diagnostic accuracy than direct application of general models.
  • Open-source multimodal models can be adapted to exceed proprietary general models on specialized medical tasks.
  • Elevated sensitivity for cancer and pneumonia supports earlier detection of critical conditions.
  • Confusion-matrix analysis reveals category-specific error patterns that guide further model refinement.
  • Reduced hallucinations through fine-tuning improves suitability for clinical decision support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Healthcare systems might gain more by fine-tuning smaller open models on local data than by relying on large proprietary APIs.
  • The performance gap could narrow if GPT-4 receives equivalent domain adaptation or prompt optimization, an experiment the paper leaves open.
  • Results encourage creation of similar fine-tuned models for additional imaging modalities such as pathology slides or ultrasound.
  • If confirmed on larger varied datasets, these models could serve as accessible diagnostic aids in resource-limited settings.

Load-bearing premise

The test dataset mirrors real clinical cases and the head-to-head comparison remains fair despite differences in original training data and the absence of prompt engineering for GPT-4.

What would settle it

A new evaluation on an independent multi-hospital image dataset, or a re-test of GPT-4 with optimized prompts, that shows GPT-4 matching or exceeding 80.37 percent accuracy and equal sensitivity on cancer and pneumonia.

Figures

Figures reproduced from arXiv: 2512.23304 by Md. Sazzadul Islam Prottasha, Nabil Walid Rafi.

Figure 1
Figure 1. Figure 1: Flow chart of disease classification IV. RESULT MedGemma and GPT-4’s performance was assessed on 6 different disease datasets: skin cancer, Alzheimer's disease, breast cancer, cardiovascular, pneumonia, and chronic kidney disease. The datasets were partitioned into train and validation set for MedGemma-4b-it model training. Utilizing LoRA, the model was trained on the primary subsets of 70% training data a… view at source ↗
Figure 2
Figure 2. Figure 2: Training accuracy of MedGemma-4b-it model. After the initial training on MedGemma-4b-it model, both MedGemma and GPT-4 were benchmarked using an identical, unseen test set for each disease class. While the MedGemma model was specifically trained before testing the images, GPT-4's performance was solely assessed through standardized prompting technique. Table IV demonstrates performance metrics across vario… view at source ↗
read the original abstract

Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript compares MedGemma-4b-it (fine-tuned via LoRA) against untuned GPT-4 on zero-shot classification of six medical diseases from images. It reports a mean test accuracy of 80.37% for the fine-tuned MedGemma versus 69.58% for GPT-4, with higher sensitivity for MedGemma on cancer and pneumonia detection, supported by confusion matrices and classification reports, and concludes that domain-specific fine-tuning is required to reduce hallucinations in clinical use.

Significance. If the empirical comparison can be shown to be fair and the test set representative, the result would supply concrete evidence that LoRA fine-tuning of a smaller domain-specific multimodal model can outperform a much larger general-purpose model on medical imaging tasks, informing choices between open-source fine-tuning and proprietary zero-shot inference in clinical AI.

major comments (2)
  1. [Abstract / Results] Abstract and Results section: the headline accuracies (80.37 % vs 69.58 %) are presented without any report of test-set size, class balance, construction method, or statistical significance testing (e.g., paired test or bootstrap CI), leaving the central performance claim without verifiable quantitative support.
  2. [Methods] Methods section: the exact prompt template and any prompt-engineering steps applied to the untuned GPT-4 are not documented, nor is any membership-inference or data-leakage analysis provided; without these controls the attribution of the 10.8-point gap to LoRA fine-tuning rather than prompt disparity or training-data overlap cannot be secured.
minor comments (2)
  1. [Abstract] The abstract describes the GPT-4 evaluation as 'zero-shot' while MedGemma is explicitly fine-tuned; a brief clarification of the differing evaluation regimes would improve readability.
  2. [Figures] Figure captions for the confusion matrices should state the exact number of test samples per class to allow readers to interpret the reported sensitivities.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback. We provide point-by-point responses to the major comments and describe the planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results section: the headline accuracies (80.37 % vs 69.58 %) are presented without any report of test-set size, class balance, construction method, or statistical significance testing (e.g., paired test or bootstrap CI), leaving the central performance claim without verifiable quantitative support.

    Authors: We agree that these supporting details are important for the credibility of our results. The revised manuscript will include the test-set size, information on class balance and construction method, and statistical significance testing (e.g., bootstrap confidence intervals) in the Results section to provide verifiable quantitative support for the reported accuracies. revision: yes

  2. Referee: [Methods] Methods section: the exact prompt template and any prompt-engineering steps applied to the untuned GPT-4 are not documented, nor is any membership-inference or data-leakage analysis provided; without these controls the attribution of the 10.8-point gap to LoRA fine-tuning rather than prompt disparity or training-data overlap cannot be secured.

    Authors: We will add the exact prompt template and details of any prompt-engineering steps for GPT-4 to the Methods section. We argue that the performance gap is due to the LoRA fine-tuning of MedGemma, as the same test set and comparable prompting strategies were used for both models. A membership-inference or data-leakage analysis was not performed in the original work. revision: partial

standing simulated objections not resolved
  • Membership-inference or data-leakage analysis

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements

full rationale

The manuscript reports measured test accuracies (80.37 % for LoRA-tuned MedGemma-4b-it versus 69.58 % for untuned GPT-4) obtained by running both models on a fixed test set of images for six disease classes. No equations, parameter-fitting steps, or self-referential definitions appear in the provided text; the headline numbers are not derived from any model-internal quantity that was itself fitted to the same data. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the comparison. The central claim therefore remains an independent empirical observation rather than a quantity forced by construction from the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical test-set performance after LoRA fine-tuning; the main unstated elements are the training data distribution and hyperparameter choices during adaptation.

free parameters (1)
  • LoRA rank and scaling factors
    These control the extent of adaptation during fine-tuning and are chosen or optimized to achieve the reported accuracy.

pith-pipeline@v0.9.0 · 5483 in / 1213 out tokens · 52029 ms · 2026-05-16T19:43:40.640854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

  1. [1]

    IDF Diabetes Atlas

    International Diabetes Federation, “IDF Diabetes Atlas.” 2019. [Online]. Available: https://www.diabetesatlas.org

  2. [2]

    Breast Cancer Fact Sheet

    World Health Organization, “Breast Cancer Fact Sheet.” 2020. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/breast-cancer

  3. [3]

    Pneumonia Fact Sheet

    World Health Organization, “Pneumonia Fact Sheet.” 2021. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/pneumonia

  4. [4]

    2024 Alzheimer's disease facts and figures,

    Alzheimer's Association, "2024 Alzheimer's disease facts and figures," Alzheimer's & Dementia, vol. 20, no. 5, pp. 3708–3821, Apr. 2024. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC11095490/

  5. [5]

    Cardiovascular Diseases Fact Sheet

    World Health Organization, “Cardiovascular Diseases Fact Sheet.” 2019. [Online]. Available: https://www.who.int/news -room/fact- sheets/detail/cardiovascular-diseases-(cvds)

  6. [6]

    Global, regional, and national burden inequality of chronic kidney disease, 1990 –2021: a systematic analysis for the global burden of disease study 2021,

    K. Xie et al., "Global, regional, and national burden inequality of chronic kidney disease, 1990 –2021: a systematic analysis for the global burden of disease study 2021," Front. Med., vol. 11, Feb. 2024. [Online]. Available: https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2024. 1501175/full

  7. [7]

    Deep Learning-Based Detection of Arrhythmia Using ECG Signals – A Comprehensive Review,

    A. Reshad, V. Nino, and M. Valero, “Deep Learning-Based Detection of Arrhythmia Using ECG Signals – A Comprehensive Review,” Vasc. Health Risk Manag. , vol. Volume 21, pp. 685 –703, Aug. 2025, doi: 10.2147/VHRM.S508620

  8. [8]

    Large Language Models Encode Clinical Knowledge,

    K. Singhal et al., “Large Language Models Encode Clinical Knowledge,” Dec. 26, 2022, arXiv: arXiv:2212.13138. doi: 10.48550/arXiv.2212.13138

  9. [9]

    Towards Expert-Level Medical Question Answering with Large Language Models,

    K. Singhal et al., “Towards Expert-Level Medical Question Answering with Large Language Models,” May 16, 2023, arXiv: arXiv:2305.09617. doi: 10.48550/arXiv.2305.09617

  10. [10]

    A systematic review and meta -analysis of diagnostic performance comparison between generative AI and physicians,

    H. Takita et al., “A systematic review and meta -analysis of diagnostic performance comparison between generative AI and physicians,” Npj Digit. Med., vol. 8, no. 1, p. 175, Mar. 2025, doi: 10.1038/s41746-025-01543-z

  11. [11]

    THE ROLE OF ARTIFICIAL INTELLIGENCE IN PERSONALIZED MEDICINE AND PREDICTIVE DIAGNOSTICS – A NARRATIVE REVIEW,

    S. Abbas et al., “THE ROLE OF ARTIFICIAL INTELLIGENCE IN PERSONALIZED MEDICINE AND PREDICTIVE DIAGNOSTICS – A NARRATIVE REVIEW,” Insights-J. Health Rehabil. , vol. 3, no. 1 (Health&Allied), pp. 624–631, Feb. 2025, doi: 10.71000/k6cga886

  12. [12]

    Evaluation and mitigation of the limitations of large language models in clinical decision-making,

    P. Hager et al., “Evaluation and mitigation of the limitations of large language models in clinical decision-making,” Nat. Med., vol. 30, no. 9, pp. 2613–2622, Sept. 2024, doi: 10.1038/s41591-024-03097-1

  13. [14]

    Language models are few-shot learners,

    T. B. Brown et al., “Language models are few-shot learners,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst. (NIPS '20) , Vancouver, BC, Canada, 2020, Art. no. 159, doi: 10.5555/3495724.3495883

  14. [15]

    Weak convergence and tightness of probability measures in an abstract Skorohod space

    A. Radford et al. , “Language Models are Unsupervised Multitask Learners,” arXiv e -prints, 2019, Art. no. 1907.10522, doi: 10.48550/arXiv.1907.10522. [Online]. Available: https://arxiv.org/abs/1907.10522

  15. [16]

    Considering the possibilities and pitfalls of Generative Pre -trained Transformer 3 (GPT -3) in healthcare delivery,

    D. M. Korngiebel and S. D. Mooney, “Considering the possibilities and pitfalls of Generative Pre -trained Transformer 3 (GPT -3) in healthcare delivery,” npj Digit. Med. , vol. 4, no. 1, Art. no. 93, 2021, doi: 10.1038/s41746-021-00464-x

  16. [17]

    Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology – a recent scoping review,

    E. Ullah et al., “Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology – a recent scoping review,” Diagn Pathol, vol. 19, no. 1, Art. no. 43, 2024, doi: 10.1186/s13000-024-01464-7

  17. [18]

    Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT,

    A. Lecler, L. Duron, and P. Soyer, “Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT,” Diagn. Interv. Imaging, vol. 104, no. 6, pp. 269 –274, 2023, doi: 10.1016/j.diii.2023.02.003

  18. [19]

    Large language models encode clinical knowledge,

    K. Singhal et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172 –180, 2023, doi: 10.1038/s41586 -023- 06291-2

  19. [20]

    BioBERT: a pre -trained biomedical language representation model for biomedical text mining,

    J. Lee et al., “BioBERT: a pre -trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2019, doi: 10.1093/bioinformatics/btz682

  20. [21]

    Publicly Available Clinical BERT Embeddings

    E. Alsentzer et al., “Publicly Available Clinical BERT Embeddings,” arXiv e-prints, 2019, Art. no. 1904.03323, doi: 10.48550/arXiv.1904.03323. [Online]. Available: https://arxiv.org/abs/1904.03323

  21. [22]

    MedGemma Technical Report

    A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, et al., “MedGemma Technical Report,” arXiv e -prints, arXiv:2507.05201 [cs.AI], 2025. [Online]. Available: https://arxiv.org/abs/2507.05201. Journal name 9 Author, Title

  22. [23]

    GPT-4 Technical Report

    OpenAI et al., “GPT -4 Technical Report,” arXiv preprint arXiv:2303.08774, 2024. [Online]. Available: https://arxiv.org/abs/2303.08774

  23. [24]

    Explainable AI applications in the Medical Domain: a systematic review,

    N. Prentzas, A. Kakas, and C. S. Pattichis, “Explainable AI applications in the Medical Domain: a systematic review,” arXiv e-prints, 2023, Art. no. 2308.05411, doi: 10.48550/arXiv.2308.05411. [Online]. Available: https://arxiv.org/abs/2308.05411

  24. [25]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,

    A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” International Conference on Learning Representations, May 2021, [Online]. Available: https://openreview.net/pdf?id=YicbFdNTTy

  25. [26]

    DeepVIT: Towards Deeper Vision Transformer,

    D. Zhou et al., “DeepVIT: Towards Deeper Vision Transformer,” arXiv.org, Mar. 22, 2021. https://arxiv.org/abs/2103.11886

  26. [27]

    New chapter in pediatric medicine: technological evolution, application, and evaluation system of large language models,

    Y. Xie, “New chapter in pediatric medicine: technological evolution, application, and evaluation system of large language models,” European Journal of Pediatrics, vol. 184, no. 12, p. 809, Dec. 2025, doi: 10.1007/s00431-025-06602-x

  27. [28]

    The HAM10000 dataset, a large collection of multi -source dermatoscopic images of common pigmented skin lesions,

    P. Tschandl, C. Rosendahl, and H. Kittler, “The HAM10000 dataset, a large collection of multi -source dermatoscopic images of common pigmented skin lesions,” Sci Data, vol. 5, no. 1, Art. no. 180161, 2018, doi: 10.1038/sdata.2018.161

  28. [29]

    Open access series of imaging studies (OASIS): Cross - sectional MRI data in young, middle aged, nondemented, and demented older adults,

    D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris, and R. L. Buckner, “Open access series of imaging studies (OASIS): Cross - sectional MRI data in young, middle aged, nondemented, and demented older adults,” Journal of Cognitive Neuroscien ce, vol. 19, no. 9, pp. 1498 – 1507, 2007. doi: 10.1162/jocn.2007.19.9.1498. [Online]. Available: ...

  29. [30]

    A curated mammography data set for use in computer -aided detection and diagnosis research,

    R. S. Lee, F. Gimenez, A. Hoogi, K. K. Miyake, M. Gorovoy, and D. L. Rubin, “A curated mammography data set for use in computer -aided detection and diagnosis research,” Scientific Data, vol. 4, p. 170177, 2017. doi: 10.1038/sdata.2017.177. [Online]. Availa ble: https://doi.org/10.1038/sdata.2017.177

  30. [31]

    ECG Images Dataset of Cardiac Patients,

    A. H. Khan and M. Hussain, “ECG Images Dataset of Cardiac Patients,” Mendeley Data, V2, 2021. doi: 10.17632/gwbz3fsgp8.2. [Online]. Available: https://doi.org/10.17632/gwbz3fsgp8.2

  31. [32]

    Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification,

    D. Kermany, K. Zhang, and M. Goldbaum, “Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification,” Mendeley Data, V2, 2018, doi: 10.17632/rscbjbr9sj.2

  32. [33]

    CT KIDNEY DATASET: Normal -Cyst-Tumor and Stone,

    N. M. Islam, “CT KIDNEY DATASET: Normal -Cyst-Tumor and Stone,” Kaggle, 2021. [Online]. Available: https://www.kaggle.com/datasets/nazmul0087/ct-kidney-dataset-normal- cyst-tumor-and-stone. [Accessed: Oct. 22, 2025]

  33. [34]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu et al., “LoRA: Low -Rank Adaptation of Large Language Models,” arXiv e -prints, 2021, Art. no. 2106.09685, doi: 10.48550/arXiv.2106.09685. [Online]. Available: https://arxiv.org/abs/2106.09685

  34. [35]

    Towards Understanding Convergence and Generalization of AdamW,

    P. Zhou, X. Xie, Z. Lin, and S. Yan, “Towards Understanding Convergence and Generalization of AdamW,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 46, no. 9, pp. 6486 –6493, Sept. 2024, doi: 10.1109/TPAMI.2024.3382294