{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:ML5KRKZ2U3SKXDRGO4CBBASGI4","short_pith_number":"pith:ML5KRKZ2","schema_version":"1.0","canonical_sha256":"62faa8ab3aa6e4ab8e2677041082464732608b331e66262eae1e44c5fcb3f97a","source":{"kind":"arxiv","id":"2404.18416","version":2},"attestation_state":"computed","paper":{"title":"Capabilities of Gemini Models in Medicine","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks.","cross_cats":["cs.CL","cs.CV","cs.LG"],"primary_cat":"cs.AI","authors_text":"Aishwarya Kamath, Alan Karthikesalingam, Albert Webson, Anil Palepu, Basil Mustafa, Ben Caine, Bradley Green, Cathy Cheung, Charles Lau, Christopher Semturs, Chunjong Park, Claire Cui, Dale Webster, Daniel McDuff, David G.T. Barrett, David Stutz, Demis Hassabis, Ehud Rivlin, Elahe Vedadi, Ellery Wulczyn, Ewa Dominowska, Fan Zhang, Greg Corrado, James Manyika, Jan Freyberg, Jean-Baptiste Alayrac, Jeff Dean, Jeremy Lai, Jesper Anderson, Jian Lu, Joelle Barral, Jonas Kemp, Jonathan Krause, Jonathon Shlens, Juanma Zambrano Chaves, Juraj Gottweis, Katherine Chou, Kavita Kulkarni, Khaled Saab, Kimberly Kanada, Koray Kavukcuoglu, Le Hou, Luheng He, Luyang Liu, Melvin Johnson, Mike Schaekermann, Natasha Latysheva, Neil Houlsby, Nenad Tomasev, Oriol Vinyals, Philip Mansfield, Renee Wong, Ruoxi Sun, Ryutaro Tanno, Shekoofeh Azizi, Siamak Shakeri, SiWai Man, S. M. Ali Eslami, S. Sara Mahdavi, Szu-Yeu Hu, Tao Tu, Tim Strother, Tomer Golany, Vivek Natarajan, Wei-Hung Weng, Yong Cheng, Yossi Matias","submitted_at":"2024-04-29T04:11:28Z","abstract_excerpt":"Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custo"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2404.18416","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.AI","submitted_at":"2024-04-29T04:11:28Z","cross_cats_sorted":["cs.CL","cs.CV","cs.LG"],"title_canon_sha256":"cb50dbe911bb83a2d23526805ba180095681824fbf212e77b5fbe93c1962eff6","abstract_canon_sha256":"caee117c34f877923c1935488b44d1956325b13f0bab70c31f5478b87fe76cc7"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.767252Z","signature_b64":"v+cDs46bCiR9exVDhqX2zgBcm9dNKqOrMA7Jl3UaZhLBU5qKSeK19NzU25nqhhMVJsMJVYemVbkUEEOgksxmCQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"62faa8ab3aa6e4ab8e2677041082464732608b331e66262eae1e44c5fcb3f97a","last_reissued_at":"2026-05-17T23:38:50.766478Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.766478Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Capabilities of Gemini Models in Medicine","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks.","cross_cats":["cs.CL","cs.CV","cs.LG"],"primary_cat":"cs.AI","authors_text":"Aishwarya Kamath, Alan Karthikesalingam, Albert Webson, Anil Palepu, Basil Mustafa, Ben Caine, Bradley Green, Cathy Cheung, Charles Lau, Christopher Semturs, Chunjong Park, Claire Cui, Dale Webster, Daniel McDuff, David G.T. Barrett, David Stutz, Demis Hassabis, Ehud Rivlin, Elahe Vedadi, Ellery Wulczyn, Ewa Dominowska, Fan Zhang, Greg Corrado, James Manyika, Jan Freyberg, Jean-Baptiste Alayrac, Jeff Dean, Jeremy Lai, Jesper Anderson, Jian Lu, Joelle Barral, Jonas Kemp, Jonathan Krause, Jonathon Shlens, Juanma Zambrano Chaves, Juraj Gottweis, Katherine Chou, Kavita Kulkarni, Khaled Saab, Kimberly Kanada, Koray Kavukcuoglu, Le Hou, Luheng He, Luyang Liu, Melvin Johnson, Mike Schaekermann, Natasha Latysheva, Neil Houlsby, Nenad Tomasev, Oriol Vinyals, Philip Mansfield, Renee Wong, Ruoxi Sun, Ryutaro Tanno, Shekoofeh Azizi, Siamak Shakeri, SiWai Man, S. M. Ali Eslami, S. Sara Mahdavi, Szu-Yeu Hu, Tao Tu, Tim Strother, Tomer Golany, Vivek Natarajan, Wei-Hung Weng, Yong Cheng, Yossi Matias","submitted_at":"2024-04-29T04:11:28Z","abstract_excerpt":"Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custo"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy on MedQA (USMLE) using a novel uncertainty-guided search strategy, surpasses the GPT-4 model family on every benchmark where direct comparison is viable, and improves over GPT-4V by an average relative margin of 44.5% on 7 multimodal benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That benchmark accuracy on curated medical datasets (MedQA, NEJM Image Challenges, MMMU health subset, etc.) will translate to reliable performance and safety in real clinical workflows with noisy, incomplete, or out-of-distribution patient data.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8c8b8dbe547f5a5930e8826c7db565b4ed2ec11b1e39f1cebae2da94e314200d"},"source":{"id":"2404.18416","kind":"arxiv","version":2},"verdict":{"id":"80055392-aa09-4f32-ba7b-94df4299bdb8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T17:08:19.092721Z","strongest_claim":"Our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy on MedQA (USMLE) using a novel uncertainty-guided search strategy, surpasses the GPT-4 model family on every benchmark where direct comparison is viable, and improves over GPT-4V by an average relative margin of 44.5% on 7 multimodal benchmarks.","one_line_summary":"Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That benchmark accuracy on curated medical datasets (MedQA, NEJM Image Challenges, MMMU health subset, etc.) will translate to reliable performance and safety in real clinical workflows with noisy, incomplete, or out-of-distribution patient data.","pith_extraction_headline":"Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks."},"references":{"count":269,"sample":[{"doi":"","year":2023,"title":"M. D. Abr \\`a moff, M. E. Tarver, N. Loyo-Berrios, S. Trujillo, D. Char, Z. Obermeyer, M. B. Eydelman, F. P. of Ophthalmic Imaging, D. Algorithmic Interpretation Working Group of the Collaborative Com","work_id":"50d30931-c231-4d9d-bd86-c80677034215","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":2,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2022,"title":"J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural inform","work_id":"689aee13-435b-45b5-9380-e288b5f0b82b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"PaLM 2 Technical Report","work_id":"905ee9a7-ea61-4a94-bd62-2600cbe3e315","ref_index":4,"cited_arxiv_id":"2305.10403","is_internal_anchor":true},{"doi":"","year":2023,"title":"F. Antaki, D. Milad, M. A. Chia, C.- \\'E . Gigu \\`e re, S. Touma, J. El-Khoury, P. A. Keane, and R. Duval. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards hum","work_id":"60f0fc36-6d81-4394-995a-bc9de437fdaf","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":269,"snapshot_sha256":"875c2bf9395867b1156eb558834c38db69f0db2904af6a525180658dc1c19ab4","internal_anchors":16},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2404.18416","created_at":"2026-05-17T23:38:50.766608+00:00"},{"alias_kind":"arxiv_version","alias_value":"2404.18416v2","created_at":"2026-05-17T23:38:50.766608+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2404.18416","created_at":"2026-05-17T23:38:50.766608+00:00"},{"alias_kind":"pith_short_12","alias_value":"ML5KRKZ2U3SK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"ML5KRKZ2U3SKXDRG","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"ML5KRKZ2","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":31,"internal_anchor_count":31,"sample":[{"citing_arxiv_id":"2605.23629","citing_title":"DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2401.02458","citing_title":"Data-Centric Foundation Models in Computational Healthcare: A Survey","ref_index":253,"is_internal_anchor":true},{"citing_arxiv_id":"2412.04468","citing_title":"NVILA: Efficient Frontier Visual Language Models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2503.18297","citing_title":"Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2504.12334","citing_title":"QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2505.03519","citing_title":"Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09450","citing_title":"ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20022","citing_title":"MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16215","citing_title":"Fully Open Meditron: An Auditable Pipeline for Clinical LLMs","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2601.20375","citing_title":"LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2602.12705","citing_title":"MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2412.18925","citing_title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08559","citing_title":"Medical Reasoning with Large Language Models: A Survey and MR-Bench","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2603.29693","citing_title":"Measuring the metacognition of AI","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07361","citing_title":"BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11143","citing_title":"ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04012","citing_title":"SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10025","citing_title":"Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10761","citing_title":"RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology","ref_index":94,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04012","citing_title":"SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01077","citing_title":"Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20560","citing_title":"LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2502.18864","citing_title":"Towards an AI co-scientist","ref_index":265,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20022","citing_title":"MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19937","citing_title":"Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning","ref_index":50,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/ML5KRKZ2U3SKXDRGO4CBBASGI4","json":"https://pith.science/pith/ML5KRKZ2U3SKXDRGO4CBBASGI4.json","graph_json":"https://pith.science/api/pith-number/ML5KRKZ2U3SKXDRGO4CBBASGI4/graph.json","events_json":"https://pith.science/api/pith-number/ML5KRKZ2U3SKXDRGO4CBBASGI4/events.json","paper":"https://pith.science/paper/ML5KRKZ2"},"agent_actions":{"view_html":"https://pith.science/pith/ML5KRKZ2U3SKXDRGO4CBBASGI4","download_json":"https://pith.science/pith/ML5KRKZ2U3SKXDRGO4CBBASGI4.json","view_paper":"https://pith.science/paper/ML5KRKZ2","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2404.18416&json=true","fetch_graph":"https://pith.science/api/pith-number/ML5KRKZ2U3SKXDRGO4CBBASGI4/graph.json","fetch_events":"https://pith.science/api/pith-number/ML5KRKZ2U3SKXDRGO4CBBASGI4/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/ML5KRKZ2U3SKXDRGO4CBBASGI4/action/timestamp_anchor","attest_storage":"https://pith.science/pith/ML5KRKZ2U3SKXDRGO4CBBASGI4/action/storage_attestation","attest_author":"https://pith.science/pith/ML5KRKZ2U3SKXDRGO4CBBASGI4/action/author_attestation","sign_citation":"https://pith.science/pith/ML5KRKZ2U3SKXDRGO4CBBASGI4/action/citation_signature","submit_replication":"https://pith.science/pith/ML5KRKZ2U3SKXDRGO4CBBASGI4/action/replication_record"}},"created_at":"2026-05-17T23:38:50.766608+00:00","updated_at":"2026-05-17T23:38:50.766608+00:00"}