Med-Flamingo: a Multimodal Medical Few-shot Learner

Cyril Zakka; Eduardo Pontes Reis; Jure Leskovec; Michael Moor; Michihiro Yasunaga; Pranav Rajpurkar; Qian Huang; Shirley Wu; Yash Dalmia

arxiv: 2307.15189 · v1 · pith:ZBX3NHJ3new · submitted 2023-07-27 · 💻 cs.CV · cs.AI

Med-Flamingo: a Multimodal Medical Few-shot Learner

Michael Moor , Qian Huang , Shirley Wu , Michihiro Yasunaga , Cyril Zakka , Yash Dalmia , Eduardo Pontes Reis , Pranav Rajpurkar

show 1 more author

Jure Leskovec

This is my paper

classification 💻 cs.CV cs.AI

keywords medicalmed-flamingofew-shotgenerativemodelsmultimodalapplicationsdata

0 comments

read the original abstract

Medicine, by its nature, is a multifaceted domain that requires the synthesis of information across various modalities. Medical generative vision-language models (VLMs) make a first step in this direction and promise many exciting clinical applications. However, existing models typically have to be fine-tuned on sizeable down-stream datasets, which poses a significant limitation as in many medical applications data is scarce, necessitating models that are capable of learning from few examples in real-time. Here we propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. Med-Flamingo unlocks few-shot generative medical visual question answering (VQA) abilities, which we evaluate on several datasets including a novel challenging open-ended VQA dataset of visual USMLE-style problems. Furthermore, we conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app. Med-Flamingo improves performance in generative medical VQA by up to 20\% in clinician's rating and firstly enables multimodal medical few-shot adaptations, such as rationale generation. We release our model, code, and evaluation app under https://github.com/snap-stanford/med-flamingo.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation
cs.CL 2026-05 unverdicted novelty 6.0

DIVE improves in-context vector distillation for medical report generation via decisive-token supervision on pathology terms and EOS plus state-conditioned dynamic steering, achieving top BLEU-4, ROUGE-L and RadGraph ...
Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Wasserstein equilibrium decoding that improves accuracy and convergence speed for small VLMs on medical VQA benchmarks by using semantic consensus instead of lexical order.
Ask4VG: Risk-Aware Question Selection for Reducing Prior-Driven Answers in Medical VQA
cs.CV 2026-05 unverdicted novelty 5.0

Ask4VG learns a risk estimator from counterfactual visual probes to rerank question rewrites, reducing held-out hallucination risk from 0.658 to 0.623 and raising accuracy from 0.337 to 0.356 on VQA-RAD.
BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation
cs.CV 2026-05 unverdicted novelty 5.0

BiomedAP improves robustness of biomedical VLMs to prompt variations using gated cross-modal fusion and dual-anchor constraints, outperforming baselines on 11 benchmarks.
FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning
cs.LG 2026-05 unverdicted novelty 5.0

FLAME is an MoE architecture using modality-specific routers and low-rank compression of expert knowledge to support efficient continual multimodal multi-task learning while reducing catastrophic forgetting.
Data-Centric Foundation Models in Computational Healthcare: A Survey
cs.LG 2024-01 unverdicted novelty 3.0

The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.