{"paper":{"title":"Capabilities of Gemini Models in Medicine","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks.","cross_cats":["cs.CL","cs.CV","cs.LG"],"primary_cat":"cs.AI","authors_text":"Aishwarya Kamath, Alan Karthikesalingam, Albert Webson, Anil Palepu, Basil Mustafa, Ben Caine, Bradley Green, Cathy Cheung, Charles Lau, Christopher Semturs, Chunjong Park, Claire Cui, Dale Webster, Daniel McDuff, David G.T. Barrett, David Stutz, Demis Hassabis, Ehud Rivlin, Elahe Vedadi, Ellery Wulczyn, Ewa Dominowska, Fan Zhang, Greg Corrado, James Manyika, Jan Freyberg, Jean-Baptiste Alayrac, Jeff Dean, Jeremy Lai, Jesper Anderson, Jian Lu, Joelle Barral, Jonas Kemp, Jonathan Krause, Jonathon Shlens, Juanma Zambrano Chaves, Juraj Gottweis, Katherine Chou, Kavita Kulkarni, Khaled Saab, Kimberly Kanada, Koray Kavukcuoglu, Le Hou, Luheng He, Luyang Liu, Melvin Johnson, Mike Schaekermann, Natasha Latysheva, Neil Houlsby, Nenad Tomasev, Oriol Vinyals, Philip Mansfield, Renee Wong, Ruoxi Sun, Ryutaro Tanno, Shekoofeh Azizi, Siamak Shakeri, SiWai Man, S. M. Ali Eslami, S. Sara Mahdavi, Szu-Yeu Hu, Tao Tu, Tim Strother, Tomer Golany, Vivek Natarajan, Wei-Hung Weng, Yong Cheng, Yossi Matias","submitted_at":"2024-04-29T04:11:28Z","abstract_excerpt":"Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custo"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy on MedQA (USMLE) using a novel uncertainty-guided search strategy, surpasses the GPT-4 model family on every benchmark where direct comparison is viable, and improves over GPT-4V by an average relative margin of 44.5% on 7 multimodal benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That benchmark accuracy on curated medical datasets (MedQA, NEJM Image Challenges, MMMU health subset, etc.) will translate to reliable performance and safety in real clinical workflows with noisy, incomplete, or out-of-distribution patient data.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8c8b8dbe547f5a5930e8826c7db565b4ed2ec11b1e39f1cebae2da94e314200d"},"source":{"id":"2404.18416","kind":"arxiv","version":2},"verdict":{"id":"80055392-aa09-4f32-ba7b-94df4299bdb8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T17:08:19.092721Z","strongest_claim":"Our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy on MedQA (USMLE) using a novel uncertainty-guided search strategy, surpasses the GPT-4 model family on every benchmark where direct comparison is viable, and improves over GPT-4V by an average relative margin of 44.5% on 7 multimodal benchmarks.","one_line_summary":"Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That benchmark accuracy on curated medical datasets (MedQA, NEJM Image Challenges, MMMU health subset, etc.) will translate to reliable performance and safety in real clinical workflows with noisy, incomplete, or out-of-distribution patient data.","pith_extraction_headline":"Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks."},"references":{"count":269,"sample":[{"doi":"","year":2023,"title":"M. D. Abr \\`a moff, M. E. Tarver, N. Loyo-Berrios, S. Trujillo, D. Char, Z. Obermeyer, M. B. Eydelman, F. P. of Ophthalmic Imaging, D. Algorithmic Interpretation Working Group of the Collaborative Com","work_id":"50d30931-c231-4d9d-bd86-c80677034215","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":2,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2022,"title":"J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural inform","work_id":"689aee13-435b-45b5-9380-e288b5f0b82b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"PaLM 2 Technical Report","work_id":"905ee9a7-ea61-4a94-bd62-2600cbe3e315","ref_index":4,"cited_arxiv_id":"2305.10403","is_internal_anchor":true},{"doi":"","year":2023,"title":"F. Antaki, D. Milad, M. A. Chia, C.- \\'E . Gigu \\`e re, S. Touma, J. El-Khoury, P. A. Keane, and R. Duval. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards hum","work_id":"60f0fc36-6d81-4394-995a-bc9de437fdaf","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":269,"snapshot_sha256":"875c2bf9395867b1156eb558834c38db69f0db2904af6a525180658dc1c19ab4","internal_anchors":16},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}