arxiv: 2604.05081 · v2 · submitted 2026-04-06 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedGemma 1.5 Technical Report

Andrew Sellergren , Chufan Gao , Fereshteh Mahvar , Timo Kohlberger , Fayaz Jamil , Madeleine Traverse , Alberto Tono , Bashir Sadjad

show 34 more authors

Lin Yang Charles Lau Liron Yatziv Tiffany Chen Bram Sterling Kenneth Philbrick Richa Tiwari Yun Liu Madhuram Jajoo Chandrashekar Sankarapu Swapnil Vispute Harshad Purandare Abhishek Bijay Mishra Sam Schmidgall Tao Tu Anil Palepu Chunjong Park Tim Strother Rahul Thapa Yong Cheng Preeti Singh Kat Black Yossi Matias Katherine Chou Avinatan Hassidim Kavi Goel Joelle Barral Tris Warkentin Shravya Shetty Dale Webster Sunny Virmani David F. Steiner Can Kirmizibayrak Daniel Golden

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:31 UTC · model grok-4.3

classification 💻 cs.AI

keywords MedGemma 1.5multimodal medical AI3D CT MRI imagingwhole slide pathologyclinical document understandinglongitudinal X-ray analysismedical vision language models

0 comments

The pith

MedGemma 1.5 4B adds 3D CT/MRI volumes and whole-slide pathology to a single model, with 11% accuracy gain on MRI classification and 47% F1 improvement on pathology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedGemma 1.5 4B as an extension of MedGemma 1 that incorporates high-dimensional medical imaging including CT and MRI volumes plus histopathology whole slide images. It achieves this by adding new training data along with long-context 3D volume slicing and whole-slide sampling strategies inside one architecture. The resulting model delivers absolute gains of 11% on 3D MRI condition classification, 3% on 3D CT, and 47% macro F1 on whole slide pathology imaging. It also improves anatomical localization by 35% IoU, longitudinal chest X-ray analysis, and text-based tasks such as 5% higher MedQA accuracy. These changes position the model as an enhanced open foundation for building medical AI systems that handle richer clinical data types.

Core claim

MedGemma 1.5 4B integrates capabilities for 3D volumetric imaging and whole-slide pathology through new training data, long-context 3D volume slicing, and whole-slide pathology sampling, producing absolute accuracy gains of 11% on 3D MRI condition classification, 3% on 3D CT classification, 47% macro F1 on pathology imaging, 35% IoU on anatomical localization, and 4% macro accuracy on multi-timepoint chest X-ray analysis, while also raising text-based clinical knowledge by 5% on MedQA and 22% on EHRQA.

What carries the argument

Long-context 3D volume slicing and whole-slide pathology sampling strategies that process high-dimensional inputs within a single unified architecture.

Load-bearing premise

The performance gains arise from the new data and specialized slicing methods rather than overfitting to the chosen evaluation sets.

What would settle it

Running MedGemma 1.5 4B on an independent collection of 3D MRI volumes collected after the model's training cutoff and observing no accuracy improvement over MedGemma 1 would falsify the generalization of the reported gains.

read the original abstract

We introduce MedGemma 1.5 4B, the latest model in the MedGemma collection. MedGemma 1.5 expands on MedGemma 1 by integrating additional capabilities: high-dimensional medical imaging (CT/MRI volumes and histopathology whole slide images), anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding (lab reports, electronic health records). We detail the innovations required to enable these modalities within a single architecture, including new training data, long-context 3D volume slicing, and whole-slide pathology sampling. Compared to MedGemma 1 4B, MedGemma 1.5 4B demonstrates significant gains in these new areas, improving 3D MRI condition classification accuracy by 11% and 3D CT condition classification by 3% (absolute improvements). In whole slide pathology imaging, MedGemma 1.5 4B achieves a 47% macro F1 gain. Additionally, it improves anatomical localization with a 35% increase in Intersection over Union on chest X-rays and achieves a 4% macro accuracy for longitudinal (multi-timepoint) chest x-ray analysis. Beyond its improved multimodal performance over MedGemma 1, MedGemma 1.5 improves on text-based clinical knowledge and reasoning, improving by 5% on MedQA accuracy and 22% on EHRQA accuracy. It also achieves an average of 18% macro F1 on 4 different lab report information extraction datasets (EHR Datasets 2, 3, 4, and Mendeley Clinical Laboratory Test Reports). Taken together, MedGemma 1.5 serves as a robust, open resource for the community, designed as an improved foundation on which developers can create the next generation of medical AI systems. Resources and tutorials for building upon MedGemma 1.5 can be found at https://goo.gle/medgemma.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedGemma 1.5 is an incremental open-model update that adds 3D volume and whole-slide pathology handling with claimed gains over the prior version, but the evaluation details stay too thin to verify the improvements.

read the letter

The core news here is a new 4B-parameter open model that extends the earlier MedGemma to handle 3D CT/MRI volumes, histopathology slides, bounding-box localization, and multi-timepoint chest X-rays, plus some text gains on clinical QA and lab-report extraction. The authors describe the practical changes they made—long-context slicing for volumes and patch sampling for slides—and report relative lifts such as +11% accuracy on 3D MRI classification, +3% on 3D CT, and +47% macro F1 on pathology slides. They also note smaller text improvements and position the release as a community foundation with tutorials attached. That openness and the concrete capability list are the useful parts; anyone building medical imaging tools now has a single checkpoint that covers more modalities than the last release. The main limitation is that every number is given only as a delta against MedGemma 1, with no absolute scores, no description of the test splits, and no statement on whether the evaluation volumes or slides were held out from training. Without those pieces it is hard to tell how much of the reported lift comes from the new sampling methods versus simple overfitting or benchmark leakage. The same gap appears for the localization IoU and longitudinal CXR numbers. This is typical of technical reports, but it leaves the central claim—that the new data and slicing strategies produce genuine generalization—unverified on the page. The paper is aimed at practitioners who want a ready-to-fine-tune medical multimodal starting point rather than at readers looking for a new algorithmic idea. I would bring it to a reading group to talk through the 3D and pathology handling choices, but I would not cite the numbers themselves until the absolute scores and contamination checks appear. It is worth sending to peer review so the authors can supply the missing evaluation protocol; the model itself is a reasonable contribution even if the current write-up needs tightening.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces MedGemma 1.5 4B as an extension of MedGemma 1, adding support for 3D CT/MRI volumes, histopathology whole-slide images, anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding. It describes new training data, long-context 3D slicing, and whole-slide sampling strategies, and claims absolute gains over the prior 4B model including +11% 3D MRI accuracy, +3% 3D CT accuracy, +47% macro F1 on pathology, +35% IoU on localization, +4% longitudinal CXR accuracy, +5% MedQA, +22% EHRQA, and 18% average macro F1 on lab-report extraction tasks.

Significance. If the gains prove robust, the work supplies an open multimodal medical foundation model with expanded 3D and pathology capabilities that could serve as a useful base for downstream medical AI systems. The explicit release of resources and tutorials is a positive contribution to reproducibility and community use.

major comments (3)

[Abstract] Abstract: All headline improvements are reported exclusively as deltas relative to MedGemma 1 (e.g., “improving 3D MRI condition classification accuracy by 11%”) with no absolute accuracy, F1, or IoU numbers supplied for either model on the same test sets. Without these values it is impossible to judge whether the new capabilities represent meaningful progress or merely modest shifts from low baselines.
[Abstract] Abstract: No information is given on evaluation datasets, train/test splits, statistical significance, or explicit checks that 3D volumes, whole-slide images, and longitudinal reports used for testing are disjoint from the new training data. These omissions directly undermine the claim that the reported gains reflect genuine generalization rather than overfitting or leakage.
[Abstract] Abstract: The paper states that long-context 3D volume slicing and whole-slide pathology sampling are key innovations, yet provides no concrete description of slice aggregation at inference, patch-level aggregation for whole slides, or how these procedures differ from standard practices, preventing assessment of their contribution to the claimed gains.

minor comments (1)

[Abstract] Abstract: The phrase “an average of 18% macro F1 on 4 different lab report information extraction datasets” is ambiguous; it is unclear whether this is a macro-average across datasets or an average of per-dataset macro F1 scores, and no per-dataset numbers are supplied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important areas for improving clarity and transparency in the manuscript. We agree that the abstract requires strengthening to better support the claims of improvement. We will make targeted revisions to the abstract and, where appropriate, the main text to address each point. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: All headline improvements are reported exclusively as deltas relative to MedGemma 1 (e.g., “improving 3D MRI condition classification accuracy by 11%”) with no absolute accuracy, F1, or IoU numbers supplied for either model on the same test sets. Without these values it is impossible to judge whether the new capabilities represent meaningful progress or merely modest shifts from low baselines.

Authors: We agree that absolute metrics are necessary for readers to assess the practical significance of the reported gains. While the full manuscript contains tables reporting absolute accuracy, F1, and IoU values for both MedGemma 1 and MedGemma 1.5 on the shared test sets, the abstract relies solely on deltas. We will revise the abstract to include key absolute numbers (e.g., the baseline and new 3D MRI accuracy, pathology macro F1, and localization IoU) alongside the deltas, ensuring the improvements can be evaluated in context. revision: yes
Referee: [Abstract] Abstract: No information is given on evaluation datasets, train/test splits, statistical significance, or explicit checks that 3D volumes, whole-slide images, and longitudinal reports used for testing are disjoint from the new training data. These omissions directly undermine the claim that the reported gains reflect genuine generalization rather than overfitting or leakage.

Authors: We recognize the critical need for explicit details on evaluation protocols to demonstrate generalization. The full manuscript describes the datasets, sources, and train/test splits for the 3D imaging, pathology, and clinical QA tasks, and states that test data were held out. We will add concise statements to the abstract confirming the use of disjoint test sets, the evaluation datasets, and any statistical significance checks performed. A dedicated paragraph on data partitioning and leakage prevention will also be added or expanded in the methods section. revision: yes
Referee: [Abstract] Abstract: The paper states that long-context 3D volume slicing and whole-slide pathology sampling are key innovations, yet provides no concrete description of slice aggregation at inference, patch-level aggregation for whole slides, or how these procedures differ from standard practices, preventing assessment of their contribution to the claimed gains.

Authors: We agree that the abstract should briefly convey the technical distinctions of our slicing and sampling approaches. The manuscript details the long-context 3D volume slicing strategy and whole-slide patch sampling in the methods, including how slices are aggregated at inference and how patch-level predictions are combined for slide-level outputs. We will revise the abstract to include a short description of these procedures and their differences from standard fixed-context or random-patch baselines. Expanded pseudocode and ablation results on aggregation choices will be added to the main text or supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed improvements

full rationale

The paper is an empirical technical report describing new training data, 3D slicing strategies, and whole-slide sampling to extend MedGemma 1 into additional medical imaging modalities. Reported gains are presented as measured outcomes on evaluation tasks rather than quantities derived by construction from the model inputs or prior results. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central claims to tautology appear in the text. External benchmarks and prior model comparisons supply independent content, keeping the derivation chain self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on the effectiveness of newly curated medical imaging and document datasets plus custom sampling strategies for 3D volumes and whole-slide images; these are not derived from first principles but chosen to enable the reported gains.

free parameters (2)

model size (4B parameters)
Chosen architecture scale that the performance claims are tied to.
training data mixture and sampling rates for 3D and pathology
Ad-hoc choices required to integrate the new modalities.

axioms (1)

domain assumption Standard supervised fine-tuning and evaluation on held-out medical benchmarks will reflect real-world utility.
Invoked when translating benchmark deltas to claims of improved foundation for medical AI systems.

pith-pipeline@v0.9.0 · 5843 in / 1395 out tokens · 59220 ms · 2026-05-10T19:31:31.829180+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We detail the innovations required to enable these modalities within a single architecture, including new training data, long-context 3D volume slicing, and whole-slide pathology sampling.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

3D CT and MR image volumes were preprocessed to sequences of individual 2D axial images... capped the number of axial slices per query to a maximum of 85

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Models Lack Temporal Awareness of Medical Knowledge
cs.LG 2026-05 unverdicted novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records
cs.IR 2026-05 unverdicted novelty 5.0

EHR-RAGp is a retrieval-augmented EHR foundation model that employs prototype-guided retrieval to dynamically integrate relevant historical patient context, outperforming prior models on clinical prediction tasks.
CXRMate-2: Structured Multimodal Temporal Embeddings and Tractable Reinforcement Learning for Clinically Acceptable Chest X-ray Radiology Report Generation
cs.CV 2026-04 unverdicted novelty 5.0

CXRMate-2 improves chest X-ray report generation via temporal embeddings and tractable RL, delivering metric gains and 45% acceptability in radiologist review with no significant preference difference on most findings.

Reference graph

Works this paper leans on

22 extracted references · 19 canonical work pages · cited by 3 Pith papers · 3 internal anchors

[1]

Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S Corrado, Dale R Webster, Shravya Shetty, Shruthi Prabhakara, et al

URLhttps://doi.org/10.17632/bygfmk4rx9.2. Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S Corrado, Dale R Webster, Shravya Shetty, Shruthi Prabhakara, et al. Polypath: Adapting a large multimodal model for multi-slide pathology report generation.arXiv preprint arXiv:2502.10536,

work page doi:10.17632/bygfmk4rx9.2
[2]

Learning to exploit temporal structure for biomedical vision-language processing

Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–1502...

work page doi:10.13026/pg10-j984
[3]

Noel C. F. Codella, David Gutman, M. Emre Celebi, Brian Helba, Michael A. Marchetti, Stephen W. Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, and Allan Halpern. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging co...

2017
[4]

Jorge Cuadros and George Bresnick

URL https://arxiv.org/abs/1710.05006. Jorge Cuadros and George Bresnick. Eyepacs: an adaptable telemedicine system for diabetic retinopa- thy screening.Journal of diabetes science and technology, 3(3):509–516,

work page arXiv
[5]

C., Celebi, E., et al

A. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, and H. E. Stanley. Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000a. Online; RRID:SCR_007345. Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roge...

work page arXiv 2016
[6]

Developing generalist foundation models from a multimodal dataset for 3d computed tomography.arXiv preprint arXiv:2403.17834, 2024

Ibrahim Ethem Hamamci, Sezgin Er, Chenyu Wang, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Omer Faruk Durugol, Benjamin Hou, Suprosanna Shit, et al. Developing generalist foundation models from a multimodal dataset for 3d computed tomography.arXiv preprint arXiv:2403.17834,

work page arXiv
[7]

Measuring Massive Multitask Language Understanding

15 MedGemma 1.5 Technical Report Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[8]

arXiv preprint arXiv:2311.13668 , year=

StephanieLHyland,ShruthiBannur,KenzaBouzid,DanielCCastro,MercyRanjit,AntonSchwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, et al. Maira-1: A specialisedlargemultimodalmodelforradiologyreportgeneration.arXivpreprintarXiv:2311.13668,

work page arXiv
[9]

arXiv preprint arXiv:2106.14463 (2021)

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, et al. Radgraph: Extracting clinical entities and relations from radiology reports.arXiv preprint arXiv:2106.14463,

work page arXiv
[10]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146,

work page arXiv 1909
[11]

MIMIC-CXR database (version 2.0

A Johnson, T Pollard, R Mark, S Berkowitz, and S Horng. MIMIC-CXR database (version 2.0. 0). PhysioNet, 2019a. Alistair Johnson, Matthew Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr-jpg - chest radiographs with structured labels, November 2019b. URL https://doi.org/10.13026/8360-t248. Alistair EW Johnson, Tom J ...

work page doi:10.13026/8360-t248
[12]

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

URLhttps: //doi.org/10.13026/acga-ht95. Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):1–10,

work page doi:10.13026/acga-ht95
[13]

AfriMed-QA: A pan-African, multi- specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,

TobiOlatunji, CharlesNimo, AbrahamOwodunni, TassallahAbdullahi, EmmanuelAyodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, et al. Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,

work page arXiv
[14]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation.arXiv preprint arXiv:2311.18260,

Ryutaro Tanno, David Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, et al. Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation.arXiv preprint arXiv:2311.18260,

work page arXiv
[16]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific Data, 5(1), August

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Scientific Data5(1), 180161 (Aug 2018)

ISSN 2052-4463. doi: 10.1038/sdata.2018.161. URLhttp://dx.doi.org/10.1038/ sdata.2018.161. Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, and Scott McLachlan. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic ele...

work page doi:10.1038/sdata.2018.161 2052
[18]

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers

doi: 10.1093/jamia/ocx079. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scalechestx-raydatabaseandbenchmarksonweakly-supervisedclassification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106,

work page doi:10.1093/jamia/ocx079 2097
[19]

Advancing multimodal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024

Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, et al. Advancing multimodal medical capabilities of gemini.arXiv preprint arXiv:2405.03162,

work page arXiv
[20]

Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362,

work page arXiv
[21]

18 MedGemma 1.5 Technical Report A. CT-RATE Evaluation We additionally evaluated a portion of our models on the CT-RATE dataset (Hamamci et al., 2024), where we process accordingly (without resampling) per Section 2.3.1 with results summarized in Table

2024
[22]

medical prior,

Unlike specialized, custom-built CT architectures that are optimized to yield multi- label predictions in a single forward pass, applying generalist vision-language models to this high- dimensional task required a more granular inference strategy. Specifically, our framework necessitated querying the model 18 times per condition to accurately parse the di...

2025