arxiv: 2512.23304 · v1 · submitted 2025-12-29 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

Md. Sazzadul Islam Prottasha , Nabil Walid Rafi

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical imagingdisease classificationmultimodal LLMsfine-tuningLoRAMedGemmaGPT-4zero-shot learning

0 comments

The pith

A fine-tuned open-source MedGemma model outperforms untuned GPT-4 at classifying six diseases from medical images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares a domain-adapted open-source multimodal model against a general proprietary one for zero-shot disease diagnosis from images. MedGemma-4b-it, after Low-Rank Adaptation fine-tuning, reaches 80.37 percent mean test accuracy while GPT-4 reaches 69.58 percent, with notably higher sensitivity on cancer and pneumonia cases. Evaluation uses confusion matrices and classification reports to break down performance across all six categories. The work argues that domain-specific adaptation reduces clinical errors such as hallucinations. Readers would care because it suggests open models can be made reliable for medical tasks through targeted tuning rather than scale alone.

Core claim

The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), achieves a mean test accuracy of 80.37 percent in classifying six diseases from medical images, outperforming the untuned GPT-4 at 69.58 percent. It also shows higher sensitivity in high-stakes tasks including cancer and pneumonia detection. Quantitative analysis through confusion matrices and classification reports supplies detailed performance insights across categories. These outcomes indicate that domain-specific fine-tuning is required to minimize hallucinations and support evidence-based medical reasoning.

What carries the argument

LoRA fine-tuning applied to the MedGemma-4b-it multimodal model, contrasted with direct use of untuned GPT-4, measured by accuracy, sensitivity, and per-class metrics on a fixed test set of images.

If this is right

Domain-specific fine-tuning produces higher diagnostic accuracy than direct application of general models.
Open-source multimodal models can be adapted to exceed proprietary general models on specialized medical tasks.
Elevated sensitivity for cancer and pneumonia supports earlier detection of critical conditions.
Confusion-matrix analysis reveals category-specific error patterns that guide further model refinement.
Reduced hallucinations through fine-tuning improves suitability for clinical decision support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Healthcare systems might gain more by fine-tuning smaller open models on local data than by relying on large proprietary APIs.
The performance gap could narrow if GPT-4 receives equivalent domain adaptation or prompt optimization, an experiment the paper leaves open.
Results encourage creation of similar fine-tuned models for additional imaging modalities such as pathology slides or ultrasound.
If confirmed on larger varied datasets, these models could serve as accessible diagnostic aids in resource-limited settings.

Load-bearing premise

The test dataset mirrors real clinical cases and the head-to-head comparison remains fair despite differences in original training data and the absence of prompt engineering for GPT-4.

What would settle it

A new evaluation on an independent multi-hospital image dataset, or a re-test of GPT-4 with optimized prompts, that shows GPT-4 matching or exceeding 80.37 percent accuracy and equal sensitivity on cancer and pneumonia.

Figures

Figures reproduced from arXiv: 2512.23304 by Md. Sazzadul Islam Prottasha, Nabil Walid Rafi.

**Figure 1.** Figure 1: Flow chart of disease classification IV. RESULT MedGemma and GPT-4’s performance was assessed on 6 different disease datasets: skin cancer, Alzheimer's disease, breast cancer, cardiovascular, pneumonia, and chronic kidney disease. The datasets were partitioned into train and validation set for MedGemma-4b-it model training. Utilizing LoRA, the model was trained on the primary subsets of 70% training data a… view at source ↗

**Figure 2.** Figure 2: Training accuracy of MedGemma-4b-it model. After the initial training on MedGemma-4b-it model, both MedGemma and GPT-4 were benchmarked using an identical, unseen test set for each disease class. While the MedGemma model was specifically trained before testing the images, GPT-4's performance was solely assessed through standardized prompting technique. Table IV demonstrates performance metrics across vario… view at source ↗

read the original abstract

Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedGemma fine-tuned with LoRA beats untuned GPT-4 on medical image classification by 11 points, but the comparison lacks prompt and data-leakage controls.

read the letter

The one thing to take away is that a LoRA-fine-tuned MedGemma-4b-it reaches 80.37% accuracy on classifying six diseases from medical images, beating untuned GPT-4 at 69.58%, with notably higher sensitivity on cancer and pneumonia. The comparison, however, lacks the controls needed to confidently credit the fine-tuning for the full gap. The paper takes established methods—LoRA adaptation and zero-shot prompting—and applies them to a direct matchup between an open domain-specific model and a general proprietary one. It does well by including confusion matrices and classification reports that break down performance by disease, giving a more complete view than accuracy alone. What is actually new here are the specific performance numbers for this MedGemma versus GPT-4 setup on the six-disease task. Prior work had not reported these exact figures, so it adds a data point to discussions about adapting models for medical imaging. The soft spots center on missing methodological details. Dataset size, class distribution, and test set construction are not described. The prompt used for GPT-4 is not given, and no checks for data overlap with GPT-4's training set are mentioned. Without prompt optimization for the closed model or leakage tests, the results could overstate the benefit of domain-specific tuning. The assumption that the test images represent typical clinical cases also remains unexamined. This paper would interest researchers selecting or adapting multimodal models for medical applications. A reader focused on practical benchmarks between open and proprietary systems could extract useful numbers, though they would need to replicate or verify the setup for their own confidence. I recommend sending it for peer review. The core comparison is relevant and the presentation is straightforward, so referees could help tighten the methods and strengthen the claims.

Referee Report

2 major / 2 minor

Summary. The manuscript compares MedGemma-4b-it (fine-tuned via LoRA) against untuned GPT-4 on zero-shot classification of six medical diseases from images. It reports a mean test accuracy of 80.37% for the fine-tuned MedGemma versus 69.58% for GPT-4, with higher sensitivity for MedGemma on cancer and pneumonia detection, supported by confusion matrices and classification reports, and concludes that domain-specific fine-tuning is required to reduce hallucinations in clinical use.

Significance. If the empirical comparison can be shown to be fair and the test set representative, the result would supply concrete evidence that LoRA fine-tuning of a smaller domain-specific multimodal model can outperform a much larger general-purpose model on medical imaging tasks, informing choices between open-source fine-tuning and proprietary zero-shot inference in clinical AI.

major comments (2)

[Abstract / Results] Abstract and Results section: the headline accuracies (80.37 % vs 69.58 %) are presented without any report of test-set size, class balance, construction method, or statistical significance testing (e.g., paired test or bootstrap CI), leaving the central performance claim without verifiable quantitative support.
[Methods] Methods section: the exact prompt template and any prompt-engineering steps applied to the untuned GPT-4 are not documented, nor is any membership-inference or data-leakage analysis provided; without these controls the attribution of the 10.8-point gap to LoRA fine-tuning rather than prompt disparity or training-data overlap cannot be secured.

minor comments (2)

[Abstract] The abstract describes the GPT-4 evaluation as 'zero-shot' while MedGemma is explicitly fine-tuned; a brief clarification of the differing evaluation regimes would improve readability.
[Figures] Figure captions for the confusion matrices should state the exact number of test samples per class to allow readers to interpret the reported sensitivities.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback. We provide point-by-point responses to the major comments and describe the planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results section: the headline accuracies (80.37 % vs 69.58 %) are presented without any report of test-set size, class balance, construction method, or statistical significance testing (e.g., paired test or bootstrap CI), leaving the central performance claim without verifiable quantitative support.

Authors: We agree that these supporting details are important for the credibility of our results. The revised manuscript will include the test-set size, information on class balance and construction method, and statistical significance testing (e.g., bootstrap confidence intervals) in the Results section to provide verifiable quantitative support for the reported accuracies. revision: yes
Referee: [Methods] Methods section: the exact prompt template and any prompt-engineering steps applied to the untuned GPT-4 are not documented, nor is any membership-inference or data-leakage analysis provided; without these controls the attribution of the 10.8-point gap to LoRA fine-tuning rather than prompt disparity or training-data overlap cannot be secured.

Authors: We will add the exact prompt template and details of any prompt-engineering steps for GPT-4 to the Methods section. We argue that the performance gap is due to the LoRA fine-tuning of MedGemma, as the same test set and comparable prompting strategies were used for both models. A membership-inference or data-leakage analysis was not performed in the original work. revision: partial

standing simulated objections not resolved

Membership-inference or data-leakage analysis

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements

full rationale

The manuscript reports measured test accuracies (80.37 % for LoRA-tuned MedGemma-4b-it versus 69.58 % for untuned GPT-4) obtained by running both models on a fixed test set of images for six disease classes. No equations, parameter-fitting steps, or self-referential definitions appear in the provided text; the headline numbers are not derived from any model-internal quantity that was itself fitted to the same data. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the comparison. The central claim therefore remains an independent empirical observation rather than a quantity forced by construction from the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical test-set performance after LoRA fine-tuning; the main unstated elements are the training data distribution and hyperparameter choices during adaptation.

free parameters (1)

LoRA rank and scaling factors
These control the extent of adaptation during fine-tuning and are chosen or optimized to achieve the reported accuracy.

pith-pipeline@v0.9.0 · 5483 in / 1213 out tokens · 52029 ms · 2026-05-16T19:43:40.640854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

[1]

IDF Diabetes Atlas

International Diabetes Federation, “IDF Diabetes Atlas.” 2019. [Online]. Available: https://www.diabetesatlas.org

work page 2019
[2]

Breast Cancer Fact Sheet

World Health Organization, “Breast Cancer Fact Sheet.” 2020. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/breast-cancer

work page 2020
[3]

Pneumonia Fact Sheet

World Health Organization, “Pneumonia Fact Sheet.” 2021. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/pneumonia

work page 2021
[4]

2024 Alzheimer's disease facts and figures,

Alzheimer's Association, "2024 Alzheimer's disease facts and figures," Alzheimer's & Dementia, vol. 20, no. 5, pp. 3708–3821, Apr. 2024. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC11095490/

work page 2024
[5]

Cardiovascular Diseases Fact Sheet

World Health Organization, “Cardiovascular Diseases Fact Sheet.” 2019. [Online]. Available: https://www.who.int/news -room/fact- sheets/detail/cardiovascular-diseases-(cvds)

work page 2019
[6]

Global, regional, and national burden inequality of chronic kidney disease, 1990 –2021: a systematic analysis for the global burden of disease study 2021,

K. Xie et al., "Global, regional, and national burden inequality of chronic kidney disease, 1990 –2021: a systematic analysis for the global burden of disease study 2021," Front. Med., vol. 11, Feb. 2024. [Online]. Available: https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2024. 1501175/full

work page doi:10.3389/fmed.2024 1990
[7]

Deep Learning-Based Detection of Arrhythmia Using ECG Signals – A Comprehensive Review,

A. Reshad, V. Nino, and M. Valero, “Deep Learning-Based Detection of Arrhythmia Using ECG Signals – A Comprehensive Review,” Vasc. Health Risk Manag. , vol. Volume 21, pp. 685 –703, Aug. 2025, doi: 10.2147/VHRM.S508620

work page doi:10.2147/vhrm.s508620 2025
[8]

Large Language Models Encode Clinical Knowledge,

K. Singhal et al., “Large Language Models Encode Clinical Knowledge,” Dec. 26, 2022, arXiv: arXiv:2212.13138. doi: 10.48550/arXiv.2212.13138

work page doi:10.48550/arxiv.2212.13138 2022
[9]

Towards Expert-Level Medical Question Answering with Large Language Models,

K. Singhal et al., “Towards Expert-Level Medical Question Answering with Large Language Models,” May 16, 2023, arXiv: arXiv:2305.09617. doi: 10.48550/arXiv.2305.09617

work page doi:10.48550/arxiv.2305.09617 2023
[10]

A systematic review and meta -analysis of diagnostic performance comparison between generative AI and physicians,

H. Takita et al., “A systematic review and meta -analysis of diagnostic performance comparison between generative AI and physicians,” Npj Digit. Med., vol. 8, no. 1, p. 175, Mar. 2025, doi: 10.1038/s41746-025-01543-z

work page doi:10.1038/s41746-025-01543-z 2025
[11]

THE ROLE OF ARTIFICIAL INTELLIGENCE IN PERSONALIZED MEDICINE AND PREDICTIVE DIAGNOSTICS – A NARRATIVE REVIEW,

S. Abbas et al., “THE ROLE OF ARTIFICIAL INTELLIGENCE IN PERSONALIZED MEDICINE AND PREDICTIVE DIAGNOSTICS – A NARRATIVE REVIEW,” Insights-J. Health Rehabil. , vol. 3, no. 1 (Health&Allied), pp. 624–631, Feb. 2025, doi: 10.71000/k6cga886

work page doi:10.71000/k6cga886 2025
[12]

Evaluation and mitigation of the limitations of large language models in clinical decision-making,

P. Hager et al., “Evaluation and mitigation of the limitations of large language models in clinical decision-making,” Nat. Med., vol. 30, no. 9, pp. 2613–2622, Sept. 2024, doi: 10.1038/s41591-024-03097-1

work page doi:10.1038/s41591-024-03097-1 2024
[14]

Language models are few-shot learners,

T. B. Brown et al., “Language models are few-shot learners,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst. (NIPS '20) , Vancouver, BC, Canada, 2020, Art. no. 159, doi: 10.5555/3495724.3495883

work page doi:10.5555/3495724.3495883 2020
[15]

Weak convergence and tightness of probability measures in an abstract Skorohod space

A. Radford et al. , “Language Models are Unsupervised Multitask Learners,” arXiv e -prints, 2019, Art. no. 1907.10522, doi: 10.48550/arXiv.1907.10522. [Online]. Available: https://arxiv.org/abs/1907.10522

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.10522 2019
[16]

Considering the possibilities and pitfalls of Generative Pre -trained Transformer 3 (GPT -3) in healthcare delivery,

D. M. Korngiebel and S. D. Mooney, “Considering the possibilities and pitfalls of Generative Pre -trained Transformer 3 (GPT -3) in healthcare delivery,” npj Digit. Med. , vol. 4, no. 1, Art. no. 93, 2021, doi: 10.1038/s41746-021-00464-x

work page doi:10.1038/s41746-021-00464-x 2021
[17]

Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology – a recent scoping review,

E. Ullah et al., “Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology – a recent scoping review,” Diagn Pathol, vol. 19, no. 1, Art. no. 43, 2024, doi: 10.1186/s13000-024-01464-7

work page doi:10.1186/s13000-024-01464-7 2024
[18]

Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT,

A. Lecler, L. Duron, and P. Soyer, “Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT,” Diagn. Interv. Imaging, vol. 104, no. 6, pp. 269 –274, 2023, doi: 10.1016/j.diii.2023.02.003

work page doi:10.1016/j.diii.2023.02.003 2023
[19]

Large language models encode clinical knowledge,

K. Singhal et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172 –180, 2023, doi: 10.1038/s41586 -023- 06291-2

work page doi:10.1038/s41586 2023
[20]

BioBERT: a pre -trained biomedical language representation model for biomedical text mining,

J. Lee et al., “BioBERT: a pre -trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2019, doi: 10.1093/bioinformatics/btz682

work page doi:10.1093/bioinformatics/btz682 2019
[21]

Publicly Available Clinical BERT Embeddings

E. Alsentzer et al., “Publicly Available Clinical BERT Embeddings,” arXiv e-prints, 2019, Art. no. 1904.03323, doi: 10.48550/arXiv.1904.03323. [Online]. Available: https://arxiv.org/abs/1904.03323

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.03323 2019
[22]

MedGemma Technical Report

A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, et al., “MedGemma Technical Report,” arXiv e -prints, arXiv:2507.05201 [cs.AI], 2025. [Online]. Available: https://arxiv.org/abs/2507.05201. Journal name 9 Author, Title

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

GPT-4 Technical Report

OpenAI et al., “GPT -4 Technical Report,” arXiv preprint arXiv:2303.08774, 2024. [Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Explainable AI applications in the Medical Domain: a systematic review,

N. Prentzas, A. Kakas, and C. S. Pattichis, “Explainable AI applications in the Medical Domain: a systematic review,” arXiv e-prints, 2023, Art. no. 2308.05411, doi: 10.48550/arXiv.2308.05411. [Online]. Available: https://arxiv.org/abs/2308.05411

work page doi:10.48550/arxiv.2308.05411 2023
[25]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,

A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” International Conference on Learning Representations, May 2021, [Online]. Available: https://openreview.net/pdf?id=YicbFdNTTy

work page 2021
[26]

DeepVIT: Towards Deeper Vision Transformer,

D. Zhou et al., “DeepVIT: Towards Deeper Vision Transformer,” arXiv.org, Mar. 22, 2021. https://arxiv.org/abs/2103.11886

work page arXiv 2021
[27]

New chapter in pediatric medicine: technological evolution, application, and evaluation system of large language models,

Y. Xie, “New chapter in pediatric medicine: technological evolution, application, and evaluation system of large language models,” European Journal of Pediatrics, vol. 184, no. 12, p. 809, Dec. 2025, doi: 10.1007/s00431-025-06602-x

work page doi:10.1007/s00431-025-06602-x 2025
[28]

The HAM10000 dataset, a large collection of multi -source dermatoscopic images of common pigmented skin lesions,

P. Tschandl, C. Rosendahl, and H. Kittler, “The HAM10000 dataset, a large collection of multi -source dermatoscopic images of common pigmented skin lesions,” Sci Data, vol. 5, no. 1, Art. no. 180161, 2018, doi: 10.1038/sdata.2018.161

work page doi:10.1038/sdata.2018.161 2018
[29]

Open access series of imaging studies (OASIS): Cross - sectional MRI data in young, middle aged, nondemented, and demented older adults,

D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris, and R. L. Buckner, “Open access series of imaging studies (OASIS): Cross - sectional MRI data in young, middle aged, nondemented, and demented older adults,” Journal of Cognitive Neuroscien ce, vol. 19, no. 9, pp. 1498 – 1507, 2007. doi: 10.1162/jocn.2007.19.9.1498. [Online]. Available: ...

work page doi:10.1162/jocn.2007.19.9.1498 2007
[30]

A curated mammography data set for use in computer -aided detection and diagnosis research,

R. S. Lee, F. Gimenez, A. Hoogi, K. K. Miyake, M. Gorovoy, and D. L. Rubin, “A curated mammography data set for use in computer -aided detection and diagnosis research,” Scientific Data, vol. 4, p. 170177, 2017. doi: 10.1038/sdata.2017.177. [Online]. Availa ble: https://doi.org/10.1038/sdata.2017.177

work page doi:10.1038/sdata.2017.177 2017
[31]

ECG Images Dataset of Cardiac Patients,

A. H. Khan and M. Hussain, “ECG Images Dataset of Cardiac Patients,” Mendeley Data, V2, 2021. doi: 10.17632/gwbz3fsgp8.2. [Online]. Available: https://doi.org/10.17632/gwbz3fsgp8.2

work page doi:10.17632/gwbz3fsgp8.2 2021
[32]

Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification,

D. Kermany, K. Zhang, and M. Goldbaum, “Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification,” Mendeley Data, V2, 2018, doi: 10.17632/rscbjbr9sj.2

work page doi:10.17632/rscbjbr9sj.2 2018
[33]

CT KIDNEY DATASET: Normal -Cyst-Tumor and Stone,

N. M. Islam, “CT KIDNEY DATASET: Normal -Cyst-Tumor and Stone,” Kaggle, 2021. [Online]. Available: https://www.kaggle.com/datasets/nazmul0087/ct-kidney-dataset-normal- cyst-tumor-and-stone. [Accessed: Oct. 22, 2025]

work page 2021
[34]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu et al., “LoRA: Low -Rank Adaptation of Large Language Models,” arXiv e -prints, 2021, Art. no. 2106.09685, doi: 10.48550/arXiv.2106.09685. [Online]. Available: https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021
[35]

Towards Understanding Convergence and Generalization of AdamW,

P. Zhou, X. Xie, Z. Lin, and S. Yan, “Towards Understanding Convergence and Generalization of AdamW,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 46, no. 9, pp. 6486 –6493, Sept. 2024, doi: 10.1109/TPAMI.2024.3382294

work page doi:10.1109/tpami.2024.3382294 2024