MedGemma Technical Report

Andrew Sellergren , Sahar Kazemzadeh , Tiam Jaroensri , Atilla Kiraly , Madeleine Traverse , Timo Kohlberger , Shawn Xu , Fayaz Jamil

show 73 more authors

C\'ian Hughes Charles Lau Justin Chen Fereshteh Mahvar Liron Yatziv Tiffany Chen Bram Sterling Stefanie Anna Baby Susanna Maria Baby Jeremy Lai Samuel Schmidgall Lu Yang Kejia Chen Per Bjornsson Shashir Reddy Ryan Brush Kenneth Philbrick Mercy Asiedu Ines Mezerreg Howard Hu Howard Yang Richa Tiwari Sunny Jansen Preeti Singh Yun Liu Shekoofeh Azizi Aishwarya Kamath Johan Ferret Shreya Pathak Nino Vieillard Ramona Merhej Sarah Perrin Tatiana Matejovicova Alexandre Ram\'e Morgane Riviere Louis Rouillard Thomas Mesnard Geoffrey Cideron Jean-bastien Grill Sabela Ramos Edouard Yvinec Michelle Casbon Elena Buchatskaya Jean-Baptiste Alayrac Dmitry Lepikhin Vlad Feinberg Sebastian Borgeaud Alek Andreev Cassidy Hardin Robert Dadashi L\'eonard Hussenot Armand Joulin Olivier Bachem Yossi Matias Katherine Chou Avinatan Hassidim Kavi Goel Clement Farabet Joelle Barral Tris Warkentin Jonathon Shlens David Fleet Victor Cotruta Omar Sanseviero Gus Martins Phoebe Kirk Anand Rao Shravya Shetty David F. Steiner Can Kirmizibayrak Rory Pilgrim Daniel Golden Lin Yang

Authors on Pith no claims yet

classification 💻 cs.AI cs.CLcs.CV

keywords medgemmamedicalmodelsperformanceapplicationscapabilitiesclassificationcollection

0 comments

read the original abstract

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Models Lack Temporal Awareness of Medical Knowledge
cs.LG 2026-05 unverdicted novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
cs.CV 2026-05 accept novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...
Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage
cs.CL 2026-03 unverdicted novelty 8.0

Dental-TriageBench is the first expert-annotated multimodal benchmark for hierarchical dental triage and shows a substantial performance gap between 19 MLLMs and junior dentists, especially on multi-domain referral cases.
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering
cs.CL 2026-05 unverdicted novelty 7.0

MedHopQA introduces a 1,000-question two-hop biomedical QA benchmark where retrieval-augmented systems reach 89% conceptual accuracy, outperforming zero-shot baselines by over 20 points.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 unverdicted novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 conditional novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 conditional novelty 7.0

EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 unverdicted novelty 7.0

EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models
cs.CV 2026-05 unverdicted novelty 7.0

iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).
The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Recorruption arises from visual attention suppression and positional bias in multimodal RAG; BAIR mitigates it via bottleneck attention intervention at inference time.
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
cs.CL 2026-05 unverdicted novelty 7.0

Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
Learning from Compressed CT: Feature Attention Style Transfer and Structured Factorized Projections for Resource-Efficient Medical Image Analysis
cs.CV 2026-05 unverdicted novelty 7.0

CT-Lite combines Feature Attention Style Transfer (FAST) and Structured Factorized Projections (SFP) with contrastive learning to reach AUROC within 5-7% of uncompressed baselines on compressed CT volumes across three...
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
cs.CV 2026-04 unverdicted novelty 7.0

X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Kn...
Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
cs.CV 2026-04 unverdicted novelty 7.0

A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 7.0

ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
cs.CV 2026-04 unverdicted novelty 7.0

SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
cs.CV 2026-04 unverdicted novelty 7.0

Medical MLLMs degrade on image classification due to four failure modes in visual representation quality, connector projection fidelity, LLM comprehension, and semantic mapping alignment, quantified by feature probing...
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
cs.CL 2026-04 unverdicted novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
cs.LG 2026-03 unverdicted novelty 7.0

SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially...
JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift
cs.CV 2026-05 unverdicted novelty 6.0

JANUS conditions Vision Transformer embeddings on macro-radiomic priors via anatomically guided gating, reaching macro-AUROC 0.88 on an internal test set of 5082 cases and 0.87 on an external set of 2000 cases while i...
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling
cs.LG 2026-05 unverdicted novelty 6.0

MILM fine-tunes LLMs on XML-encoded multimodal irregular time series via a two-stage process that exploits informative sampling patterns to achieve top performance on EHR classification datasets.
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
cs.CV 2026-05 unverdicted novelty 6.0

CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
cs.CV 2026-05 unverdicted novelty 6.0

Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
cs.CL 2026-05 unverdicted novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments
cs.CL 2026-05 unverdicted novelty 6.0

MedExAgent models clinical diagnosis as a POMDP with patient and exam noise, then uses supervised fine-tuning followed by DAPO optimization to train an agent that matches larger models on diagnostic accuracy while con...
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
cs.CV 2026-04 unverdicted novelty 6.0

MedSynapse-V evolves latent diagnostic memories via meta queries, causal counterfactual refinement with RL, and dual-branch memory transition to outperform prior medical VLM methods in diagnostic accuracy.
From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
cs.SE 2026-04 unverdicted novelty 6.0

Open-weight LLMs reach 81-91% success generating formally verified Dafny code for complex algorithmic problems when given structural signatures and self-healing verifier feedback.
Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation
cs.CR 2026-04 unverdicted novelty 6.0

Hybrid DP with LLM or NER preprocessing significantly improves the privacy-utility trade-off for Dutch clinical note de-identification compared to standalone DP.
Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control
cs.CV 2026-04 unverdicted novelty 6.0

Fine-tuned MLLMs achieve competitive skeletal landmark localization on synthetic and real X-ray datasets compared to deep learning baselines and demonstrate reasoning for sequential C-arm navigation.
Hybrid Decision Making via Conformal VLM-generated Guidance
cs.AI 2026-04 unverdicted novelty 6.0

ConfGuide uses conformal risk control to generate targeted guidance sets in a learning-to-guide hybrid decision framework and demonstrates it on multi-label medical diagnosis.
Representation geometry shapes task performance in vision-language modeling for CT enterography
cs.CV 2026-04 unverdicted novelty 6.0

Mean pooling and multi-window RGB encoding optimize vision-language performance on CT enterography, with retrieval-augmented generation substantially improving automated report severity accuracy over fine-tuning alone.
MedConcept: Unsupervised Concept Discovery for Interpretability in Medical VLMs
cs.CV 2026-04 unverdicted novelty 6.0

MedConcept extracts reusable medical concepts from VLMs via sparse neuron activations, translates them to pseudo-reports, and scores them for semantic alignment using an independent medical LLM.
How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts
cs.CL 2026-04 unverdicted novelty 6.0

ClinicNumRobBench shows LLMs excel at value retrieval from clinical notes but struggle with relational comparisons and aggregations, with performance dropping under note-style variations and after medical fine-tuning.
Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

An LLM-based NLP tool was developed and tested to identify four types of HIV stigma in clinical notes, achieving up to 0.62 micro F1 score with GatorTron-large.
Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction
cs.LG 2026-03 unverdicted novelty 6.0

RAVEN pretrains on over one million EHR sequences via recurrence-aware next-visit event prediction, enabling zero-shot disease incidence forecasting that rivals fine-tuned models and generalizes across cohorts.
Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
cs.CV 2026-03 unverdicted novelty 6.0

Instruction-free tuning of LVLMs on medical image-description pairs via momentum proxy instructions and response shuffling achieves SOTA accuracy on VQA tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
cs.AI 2026-05 unverdicted novelty 5.0

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
MultiMedVision: Multi-Modal Medical Vision Framework
cs.CV 2026-05 unverdicted novelty 5.0

A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.
CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification
cs.CV 2026-05 conditional novelty 5.0

CT-IDP derives over 900 quantitative phenotypes from multi-organ CT segmentations and uses sparse logistic regression to classify diseases, achieving macro-AUCs of 0.897/0.877/0.780 on MERLIN/Duke-Abdomen/AMOS dataset...
NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research
cs.AI 2026-05 unverdicted novelty 5.0

NeuroAgent uses a hierarchical LLM agent framework with Generate-Execute-Validate loops to automate neuroimaging preprocessing, reaching 84.8% end-to-end correctness and 0.9518 AUC for Alzheimer's classification on 14...
Systematic Evaluation of Large Language Models for Post-Discharge Clinical Action Extraction
cs.AI 2026-05 unverdicted novelty 5.0

LLMs match or beat supervised BERT models on detecting whether a discharge note contains an actionable clinical task but trail on classifying the exact type of action, pointing to the need for datasets that explain wh...
ReMedi: Reasoner for Medical Clinical Prediction
cs.CL 2026-05 unverdicted novelty 5.0

ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
cs.AI 2026-04 unverdicted novelty 5.0

LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not y...
Toward Multimodal Conversational AI for Age-Related Macular Degeneration
cs.CV 2026-04 unverdicted novelty 5.0

OcularChat, fine-tuned from Qwen2.5-VL on 705k simulated dialogues, reaches 0.954/0.849/0.678 accuracy on three AMD tasks in AREDS data, beats prior MLLMs, and scores higher than baselines on ophthalmologist grading o...
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs
cs.CL 2026-04 unverdicted novelty 5.0

A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
Retrieval-Guided Generation for Safer Histopathology Image Captioning
cs.CV 2026-04 unverdicted novelty 5.0

Retrieval-guided captioning from similar cases achieves higher semantic alignment (cosine similarity ~0.60 vs ~0.47) and fewer unsupported diagnoses than MedGemma on the ARCH dataset.
Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
cs.CL 2026-04 conditional novelty 5.0

Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
cs.LG 2026-04 unverdicted novelty 5.0

SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
CXRMate-2: Structured Multimodal Temporal Embeddings and Tractable Reinforcement Learning for Clinically Acceptable Chest X-ray Radiology Report Generation
cs.CV 2026-04 unverdicted novelty 5.0

CXRMate-2 improves chest X-ray report generation via temporal embeddings and tractable RL, delivering metric gains and 45% acceptability in radiologist review with no significant preference difference on most findings.
Coding-Free and Privacy-Preserving Agentic Framework for Data-Driven Clinical Research
cs.CL 2026-04 unverdicted novelty 5.0

CARIS is a new agentic LLM framework that automates clinical research workflows from planning to reporting in a coding-free and privacy-preserving manner, achieving high completeness scores on heterogeneous datasets.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs
cs.CV 2026-03 unverdicted novelty 5.0

CogAlign uses hierarchical supervised fine-tuning on clinical cognition data plus counterfactual RL to align MLLMs with expert diagnostic pathways and enforce causal lesion grounding for GI endoscopy diagnosis.