{"paper":{"title":"Capabilities of GPT-4 on Medical Challenge Problems","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GPT-4 exceeds the USMLE passing score by over 20 points without any medical-specific training or prompts.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Dean Carignan, Eric Horvitz, Harsha Nori, Nicholas King, Scott Mayer McKinney","submitted_at":"2023-03-20T16:18:38Z","abstract_excerpt":"Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and benchmark datasets. GPT-4 is a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks. Our analysis covers two sets of official practice materials for the USMLE, a three-step examination program used to assess clinical competency and grant licensure in the U"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM).","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The official USMLE practice materials used are representative of the actual exam content and difficulty, and the model has not memorized the specific questions during pre-training (probed but not fully detailed in available text).","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"GPT-4 exceeds the USMLE passing score by over 20 points without any medical-specific training or prompts.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"aa5c4eb7d7207b2a0f322277a88920b29e5aa820fc6a70938f87f5349b2f9c45"},"source":{"id":"2303.13375","kind":"arxiv","version":2},"verdict":{"id":"9cb7a1f6-64b4-4b96-a8d5-8b5435604c7b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T13:39:07.226053Z","strongest_claim":"GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM).","one_line_summary":"GPT-4 exceeds the USMLE passing score by more than 20 points and outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM on the MultiMedQA benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The official USMLE practice materials used are representative of the actual exam content and difficulty, and the model has not memorized the specific questions during pre-training (probed but not fully detailed in available text).","pith_extraction_headline":"GPT-4 exceeds the USMLE passing score by over 20 points without any medical-specific training or prompts."},"references":{"count":23,"sample":[{"doi":"","year":2019,"title":"Guidelines for human-AI interaction","work_id":"1915e653-ede4-4aa0-8b7b-ff8f5553f406","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1901,"title":"Lan- guage models are few-shot learners","work_id":"75db645b-ae81-481a-bdbc-12b3f493d004","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","ref_index":3,"cited_arxiv_id":"1810.04805","is_internal_anchor":true},{"doi":"","year":1951,"title":"Automated identiﬁcation of adults at risk for in-hospital clinical deterioration","work_id":"c8446f66-fd7d-4d21-bee4-df87f15d6612","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Who goes ﬁrst? Inﬂuences of human-ai workﬂow on decision making in clinical imaging","work_id":"22b87454-3f97-4f73-8349-554d6fdc39c6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":23,"snapshot_sha256":"a642b93a7c1f94352111c4390b17bf3f3284ffcf42f266171218c391874f291a","internal_anchors":8},"formal_canon":{"evidence_count":1,"snapshot_sha256":"25353279e17364f83947f1997e60edd69bc0e9ff5a17263edda515bbdb3b388d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}