pith. machine review for the scientific record. sign in

arxiv: 2604.05081 · v2 · submitted 2026-04-06 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedGemma 1.5 Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords MedGemma 1.5multimodal medical AI3D CT MRI imagingwhole slide pathologyclinical document understandinglongitudinal X-ray analysismedical vision language models
0
0 comments X

The pith

MedGemma 1.5 4B adds 3D CT/MRI volumes and whole-slide pathology to a single model, with 11% accuracy gain on MRI classification and 47% F1 improvement on pathology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedGemma 1.5 4B as an extension of MedGemma 1 that incorporates high-dimensional medical imaging including CT and MRI volumes plus histopathology whole slide images. It achieves this by adding new training data along with long-context 3D volume slicing and whole-slide sampling strategies inside one architecture. The resulting model delivers absolute gains of 11% on 3D MRI condition classification, 3% on 3D CT, and 47% macro F1 on whole slide pathology imaging. It also improves anatomical localization by 35% IoU, longitudinal chest X-ray analysis, and text-based tasks such as 5% higher MedQA accuracy. These changes position the model as an enhanced open foundation for building medical AI systems that handle richer clinical data types.

Core claim

MedGemma 1.5 4B integrates capabilities for 3D volumetric imaging and whole-slide pathology through new training data, long-context 3D volume slicing, and whole-slide pathology sampling, producing absolute accuracy gains of 11% on 3D MRI condition classification, 3% on 3D CT classification, 47% macro F1 on pathology imaging, 35% IoU on anatomical localization, and 4% macro accuracy on multi-timepoint chest X-ray analysis, while also raising text-based clinical knowledge by 5% on MedQA and 22% on EHRQA.

What carries the argument

Long-context 3D volume slicing and whole-slide pathology sampling strategies that process high-dimensional inputs within a single unified architecture.

Load-bearing premise

The performance gains arise from the new data and specialized slicing methods rather than overfitting to the chosen evaluation sets.

What would settle it

Running MedGemma 1.5 4B on an independent collection of 3D MRI volumes collected after the model's training cutoff and observing no accuracy improvement over MedGemma 1 would falsify the generalization of the reported gains.

read the original abstract

We introduce MedGemma 1.5 4B, the latest model in the MedGemma collection. MedGemma 1.5 expands on MedGemma 1 by integrating additional capabilities: high-dimensional medical imaging (CT/MRI volumes and histopathology whole slide images), anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding (lab reports, electronic health records). We detail the innovations required to enable these modalities within a single architecture, including new training data, long-context 3D volume slicing, and whole-slide pathology sampling. Compared to MedGemma 1 4B, MedGemma 1.5 4B demonstrates significant gains in these new areas, improving 3D MRI condition classification accuracy by 11% and 3D CT condition classification by 3% (absolute improvements). In whole slide pathology imaging, MedGemma 1.5 4B achieves a 47% macro F1 gain. Additionally, it improves anatomical localization with a 35% increase in Intersection over Union on chest X-rays and achieves a 4% macro accuracy for longitudinal (multi-timepoint) chest x-ray analysis. Beyond its improved multimodal performance over MedGemma 1, MedGemma 1.5 improves on text-based clinical knowledge and reasoning, improving by 5% on MedQA accuracy and 22% on EHRQA accuracy. It also achieves an average of 18% macro F1 on 4 different lab report information extraction datasets (EHR Datasets 2, 3, 4, and Mendeley Clinical Laboratory Test Reports). Taken together, MedGemma 1.5 serves as a robust, open resource for the community, designed as an improved foundation on which developers can create the next generation of medical AI systems. Resources and tutorials for building upon MedGemma 1.5 can be found at https://goo.gle/medgemma.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces MedGemma 1.5 4B as an extension of MedGemma 1, adding support for 3D CT/MRI volumes, histopathology whole-slide images, anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding. It describes new training data, long-context 3D slicing, and whole-slide sampling strategies, and claims absolute gains over the prior 4B model including +11% 3D MRI accuracy, +3% 3D CT accuracy, +47% macro F1 on pathology, +35% IoU on localization, +4% longitudinal CXR accuracy, +5% MedQA, +22% EHRQA, and 18% average macro F1 on lab-report extraction tasks.

Significance. If the gains prove robust, the work supplies an open multimodal medical foundation model with expanded 3D and pathology capabilities that could serve as a useful base for downstream medical AI systems. The explicit release of resources and tutorials is a positive contribution to reproducibility and community use.

major comments (3)
  1. [Abstract] Abstract: All headline improvements are reported exclusively as deltas relative to MedGemma 1 (e.g., “improving 3D MRI condition classification accuracy by 11%”) with no absolute accuracy, F1, or IoU numbers supplied for either model on the same test sets. Without these values it is impossible to judge whether the new capabilities represent meaningful progress or merely modest shifts from low baselines.
  2. [Abstract] Abstract: No information is given on evaluation datasets, train/test splits, statistical significance, or explicit checks that 3D volumes, whole-slide images, and longitudinal reports used for testing are disjoint from the new training data. These omissions directly undermine the claim that the reported gains reflect genuine generalization rather than overfitting or leakage.
  3. [Abstract] Abstract: The paper states that long-context 3D volume slicing and whole-slide pathology sampling are key innovations, yet provides no concrete description of slice aggregation at inference, patch-level aggregation for whole slides, or how these procedures differ from standard practices, preventing assessment of their contribution to the claimed gains.
minor comments (1)
  1. [Abstract] Abstract: The phrase “an average of 18% macro F1 on 4 different lab report information extraction datasets” is ambiguous; it is unclear whether this is a macro-average across datasets or an average of per-dataset macro F1 scores, and no per-dataset numbers are supplied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important areas for improving clarity and transparency in the manuscript. We agree that the abstract requires strengthening to better support the claims of improvement. We will make targeted revisions to the abstract and, where appropriate, the main text to address each point. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: All headline improvements are reported exclusively as deltas relative to MedGemma 1 (e.g., “improving 3D MRI condition classification accuracy by 11%”) with no absolute accuracy, F1, or IoU numbers supplied for either model on the same test sets. Without these values it is impossible to judge whether the new capabilities represent meaningful progress or merely modest shifts from low baselines.

    Authors: We agree that absolute metrics are necessary for readers to assess the practical significance of the reported gains. While the full manuscript contains tables reporting absolute accuracy, F1, and IoU values for both MedGemma 1 and MedGemma 1.5 on the shared test sets, the abstract relies solely on deltas. We will revise the abstract to include key absolute numbers (e.g., the baseline and new 3D MRI accuracy, pathology macro F1, and localization IoU) alongside the deltas, ensuring the improvements can be evaluated in context. revision: yes

  2. Referee: [Abstract] Abstract: No information is given on evaluation datasets, train/test splits, statistical significance, or explicit checks that 3D volumes, whole-slide images, and longitudinal reports used for testing are disjoint from the new training data. These omissions directly undermine the claim that the reported gains reflect genuine generalization rather than overfitting or leakage.

    Authors: We recognize the critical need for explicit details on evaluation protocols to demonstrate generalization. The full manuscript describes the datasets, sources, and train/test splits for the 3D imaging, pathology, and clinical QA tasks, and states that test data were held out. We will add concise statements to the abstract confirming the use of disjoint test sets, the evaluation datasets, and any statistical significance checks performed. A dedicated paragraph on data partitioning and leakage prevention will also be added or expanded in the methods section. revision: yes

  3. Referee: [Abstract] Abstract: The paper states that long-context 3D volume slicing and whole-slide pathology sampling are key innovations, yet provides no concrete description of slice aggregation at inference, patch-level aggregation for whole slides, or how these procedures differ from standard practices, preventing assessment of their contribution to the claimed gains.

    Authors: We agree that the abstract should briefly convey the technical distinctions of our slicing and sampling approaches. The manuscript details the long-context 3D volume slicing strategy and whole-slide patch sampling in the methods, including how slices are aggregated at inference and how patch-level predictions are combined for slide-level outputs. We will revise the abstract to include a short description of these procedures and their differences from standard fixed-context or random-patch baselines. Expanded pseudocode and ablation results on aggregation choices will be added to the main text or supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed improvements

full rationale

The paper is an empirical technical report describing new training data, 3D slicing strategies, and whole-slide sampling to extend MedGemma 1 into additional medical imaging modalities. Reported gains are presented as measured outcomes on evaluation tasks rather than quantities derived by construction from the model inputs or prior results. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central claims to tautology appear in the text. External benchmarks and prior model comparisons supply independent content, keeping the derivation chain self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on the effectiveness of newly curated medical imaging and document datasets plus custom sampling strategies for 3D volumes and whole-slide images; these are not derived from first principles but chosen to enable the reported gains.

free parameters (2)
  • model size (4B parameters)
    Chosen architecture scale that the performance claims are tied to.
  • training data mixture and sampling rates for 3D and pathology
    Ad-hoc choices required to integrate the new modalities.
axioms (1)
  • domain assumption Standard supervised fine-tuning and evaluation on held-out medical benchmarks will reflect real-world utility.
    Invoked when translating benchmark deltas to claims of improved foundation for medical AI systems.

pith-pipeline@v0.9.0 · 5843 in / 1395 out tokens · 59220 ms · 2026-05-10T19:31:31.829180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Models Lack Temporal Awareness of Medical Knowledge

    cs.LG 2026-05 unverdicted novelty 8.0

    LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

  2. EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records

    cs.IR 2026-05 unverdicted novelty 5.0

    EHR-RAGp is a retrieval-augmented EHR foundation model that employs prototype-guided retrieval to dynamically integrate relevant historical patient context, outperforming prior models on clinical prediction tasks.

  3. CXRMate-2: Structured Multimodal Temporal Embeddings and Tractable Reinforcement Learning for Clinically Acceptable Chest X-ray Radiology Report Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    CXRMate-2 improves chest X-ray report generation via temporal embeddings and tractable RL, delivering metric gains and 45% acceptability in radiologist review with no significant preference difference on most findings.

Reference graph

Works this paper leans on

22 extracted references · 19 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S Corrado, Dale R Webster, Shravya Shetty, Shruthi Prabhakara, et al

    URLhttps://doi.org/10.17632/bygfmk4rx9.2. Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S Corrado, Dale R Webster, Shravya Shetty, Shruthi Prabhakara, et al. Polypath: Adapting a large multimodal model for multi-slide pathology report generation.arXiv preprint arXiv:2502.10536,

  2. [2]

    Learning to exploit temporal structure for biomedical vision-language processing

    Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–1502...

  3. [3]

    Noel C. F. Codella, David Gutman, M. Emre Celebi, Brian Helba, Michael A. Marchetti, Stephen W. Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, and Allan Halpern. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging co...

  4. [4]

    Jorge Cuadros and George Bresnick

    URL https://arxiv.org/abs/1710.05006. Jorge Cuadros and George Bresnick. Eyepacs: an adaptable telemedicine system for diabetic retinopa- thy screening.Journal of diabetes science and technology, 3(3):509–516,

  5. [5]

    C., Celebi, E., et al

    A. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, and H. E. Stanley. Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000a. Online; RRID:SCR_007345. Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roge...

  6. [6]

    Developing generalist foundation models from a multimodal dataset for 3d computed tomography.arXiv preprint arXiv:2403.17834, 2024

    Ibrahim Ethem Hamamci, Sezgin Er, Chenyu Wang, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Omer Faruk Durugol, Benjamin Hou, Suprosanna Shit, et al. Developing generalist foundation models from a multimodal dataset for 3d computed tomography.arXiv preprint arXiv:2403.17834,

  7. [7]

    Measuring Massive Multitask Language Understanding

    15 MedGemma 1.5 Technical Report Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  8. [8]

    arXiv preprint arXiv:2311.13668 , year=

    StephanieLHyland,ShruthiBannur,KenzaBouzid,DanielCCastro,MercyRanjit,AntonSchwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, et al. Maira-1: A specialisedlargemultimodalmodelforradiologyreportgeneration.arXivpreprintarXiv:2311.13668,

  9. [9]

    arXiv preprint arXiv:2106.14463 (2021)

    Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, et al. Radgraph: Extracting clinical entities and relations from radiology reports.arXiv preprint arXiv:2106.14463,

  10. [10]

    Pubmedqa: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146,

  11. [11]

    MIMIC-CXR database (version 2.0

    A Johnson, T Pollard, R Mark, S Berkowitz, and S Horng. MIMIC-CXR database (version 2.0. 0). PhysioNet, 2019a. Alistair Johnson, Matthew Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr-jpg - chest radiographs with structured labels, November 2019b. URL https://doi.org/10.13026/8360-t248. Alistair EW Johnson, Tom J ...

  12. [12]

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

    URLhttps: //doi.org/10.13026/acga-ht95. Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):1–10,

  13. [13]

    AfriMed-QA: A pan-African, multi- specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,

    TobiOlatunji, CharlesNimo, AbrahamOwodunni, TassallahAbdullahi, EmmanuelAyodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, et al. Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,

  14. [14]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201,

  15. [15]

    Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation.arXiv preprint arXiv:2311.18260,

    Ryutaro Tanno, David Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, et al. Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation.arXiv preprint arXiv:2311.18260,

  16. [16]

    Gemma 3 Technical Report

    URLhttps://arxiv.org/abs/2503.19786. Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific Data, 5(1), August

  17. [17]

    Scientific Data5(1), 180161 (Aug 2018)

    ISSN 2052-4463. doi: 10.1038/sdata.2018.161. URLhttp://dx.doi.org/10.1038/ sdata.2018.161. Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, and Scott McLachlan. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic ele...

  18. [18]

    Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers

    doi: 10.1093/jamia/ocx079. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scalechestx-raydatabaseandbenchmarksonweakly-supervisedclassification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106,

  19. [19]

    Advancing multimodal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024

    Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, et al. Advancing multimodal medical capabilities of gemini.arXiv preprint arXiv:2405.03162,

  20. [20]

    Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

    Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362,

  21. [21]

    18 MedGemma 1.5 Technical Report A. CT-RATE Evaluation We additionally evaluated a portion of our models on the CT-RATE dataset (Hamamci et al., 2024), where we process accordingly (without resampling) per Section 2.3.1 with results summarized in Table

  22. [22]

    medical prior,

    Unlike specialized, custom-built CT architectures that are optimized to yield multi- label predictions in a single forward pass, applying generalist vision-language models to this high- dimensional task required a more granular inference strategy. Specifically, our framework necessitated querying the model 18 times per condition to accurately parse the di...