{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:SUKURWIFFRESUY3MBWZPFT3YYO","short_pith_number":"pith:SUKURWIF","schema_version":"1.0","canonical_sha256":"951548d9052c492a636c0db2f2cf78c3aadbf4b49273a797febc9f7060342288","source":{"kind":"arxiv","id":"2305.09617","version":1},"attestation_state":"computed","paper":{"title":"Towards Expert-Level Medical Question Answering with Large Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Alan Karthikesalingam, Amy Wang, Blaise Aguera y Arcas, Bradley Green, Christopher Semturs, Dale Webster, Darlene Neal, Ellery Wulczyn, Ewa Dominowska, Greg S. Corrado, Heather Cole-Lewis, Joelle Barral, Juraj Gottweis, Karan Singhal, Kevin Clark, Le Hou, Mike Schaekermann, Mohamed Amin, Nenad Tomasev, Philip Mansfield, Renee Wong, Rory Sayres, Sami Lachgar, Shekoofeh Azizi, S. Sara Mahdavi, Stephen Pfohl, Sushant Prakash, Tao Tu, Vivek Natarajan, Yossi Matias, Yun Liu","submitted_at":"2023-05-16T17:11:29Z","abstract_excerpt":"Recent artificial intelligence (AI) systems have reached milestones in \"grand challenges\" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge.\n  Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a \"passing\" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested sign"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":false},"canonical_record":{"source":{"id":"2305.09617","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2023-05-16T17:11:29Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"d9273743d6ca60edb5588541041fdcd1e684d53a700b513102c6d5e380007bcb","abstract_canon_sha256":"85392a0c5951611f4650725d01d6772fd9912b3132ec1058ef4b5c370d7b972a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-24T04:28:47.141567Z","signature_b64":"UH2nDKKRMCH9By6rXef28pig4CLkjNoVViDPFyNbELvxxVs40QETl0Q1B86r2nERYqEUFgX9NanJqgRNm38ZBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"951548d9052c492a636c0db2f2cf78c3aadbf4b49273a797febc9f7060342288","last_reissued_at":"2026-05-24T04:28:47.139132Z","signature_status":"signed_v1","first_computed_at":"2026-05-24T04:28:47.139132Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Towards Expert-Level Medical Question Answering with Large Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Alan Karthikesalingam, Amy Wang, Blaise Aguera y Arcas, Bradley Green, Christopher Semturs, Dale Webster, Darlene Neal, Ellery Wulczyn, Ewa Dominowska, Greg S. Corrado, Heather Cole-Lewis, Joelle Barral, Juraj Gottweis, Karan Singhal, Kevin Clark, Le Hou, Mike Schaekermann, Mohamed Amin, Nenad Tomasev, Philip Mansfield, Renee Wong, Rory Sayres, Sami Lachgar, Shekoofeh Azizi, S. Sara Mahdavi, Stephen Pfohl, Sushant Prakash, Tao Tu, Vivek Natarajan, Yossi Matias, Yun Liu","submitted_at":"2023-05-16T17:11:29Z","abstract_excerpt":"Recent artificial intelligence (AI) systems have reached milestones in \"grand challenges\" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge.\n  Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a \"passing\" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested sign"},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"2305.09617","kind":"arxiv","version":1},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2305.09617/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2305.09617","created_at":"2026-05-24T04:28:47.139279+00:00"},{"alias_kind":"arxiv_version","alias_value":"2305.09617v1","created_at":"2026-05-24T04:28:47.139279+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2305.09617","created_at":"2026-05-24T04:28:47.139279+00:00"},{"alias_kind":"pith_short_12","alias_value":"SUKURWIFFRES","created_at":"2026-05-24T04:28:47.139279+00:00"},{"alias_kind":"pith_short_16","alias_value":"SUKURWIFFRESUY3M","created_at":"2026-05-24T04:28:47.139279+00:00"},{"alias_kind":"pith_short_8","alias_value":"SUKURWIF","created_at":"2026-05-24T04:28:47.139279+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":24,"internal_anchor_count":24,"sample":[{"citing_arxiv_id":"2401.02458","citing_title":"Data-Centric Foundation Models in Computational Healthcare: A Survey","ref_index":275,"is_internal_anchor":true},{"citing_arxiv_id":"2407.13193","citing_title":"Retrieval-Augmented Generation for Natural Language Processing: A Survey","ref_index":159,"is_internal_anchor":true},{"citing_arxiv_id":"2409.00084","citing_title":"Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2410.18856","citing_title":"Entry-level guide to the use of large language models for medical research","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14751","citing_title":"Query pipeline optimization for cancer patient question answering systems","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2502.16022","citing_title":"Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2406.04244","citing_title":"Benchmark Data Contamination of Large Language Models: A Survey","ref_index":137,"is_internal_anchor":true},{"citing_arxiv_id":"2504.12334","citing_title":"QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2512.23304","citing_title":"MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2404.18416","citing_title":"Capabilities of Gemini Models in Medicine","ref_index":262,"is_internal_anchor":true},{"citing_arxiv_id":"2412.18925","citing_title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02501","citing_title":"ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2310.05737","citing_title":"Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation","ref_index":255,"is_internal_anchor":true},{"citing_arxiv_id":"2311.05232","citing_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","ref_index":292,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03476","citing_title":"CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06584","citing_title":"NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05407","citing_title":"PRISM: Perception Reasoning Interleaved for Sequential Decision Making","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2402.06196","citing_title":"Large Language Models: A Survey","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2502.18864","citing_title":"Towards an AI co-scientist","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15676","citing_title":"EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17114","citing_title":"The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21027","citing_title":"HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering","ref_index":245,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01048","citing_title":"Compared to What? Baselines and Metrics for Counterfactual Prompting","ref_index":133,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06584","citing_title":"NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research","ref_index":14,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/SUKURWIFFRESUY3MBWZPFT3YYO","json":"https://pith.science/pith/SUKURWIFFRESUY3MBWZPFT3YYO.json","graph_json":"https://pith.science/api/pith-number/SUKURWIFFRESUY3MBWZPFT3YYO/graph.json","events_json":"https://pith.science/api/pith-number/SUKURWIFFRESUY3MBWZPFT3YYO/events.json","paper":"https://pith.science/paper/SUKURWIF"},"agent_actions":{"view_html":"https://pith.science/pith/SUKURWIFFRESUY3MBWZPFT3YYO","download_json":"https://pith.science/pith/SUKURWIFFRESUY3MBWZPFT3YYO.json","view_paper":"https://pith.science/paper/SUKURWIF","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2305.09617&json=true","fetch_graph":"https://pith.science/api/pith-number/SUKURWIFFRESUY3MBWZPFT3YYO/graph.json","fetch_events":"https://pith.science/api/pith-number/SUKURWIFFRESUY3MBWZPFT3YYO/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/SUKURWIFFRESUY3MBWZPFT3YYO/action/timestamp_anchor","attest_storage":"https://pith.science/pith/SUKURWIFFRESUY3MBWZPFT3YYO/action/storage_attestation","attest_author":"https://pith.science/pith/SUKURWIFFRESUY3MBWZPFT3YYO/action/author_attestation","sign_citation":"https://pith.science/pith/SUKURWIFFRESUY3MBWZPFT3YYO/action/citation_signature","submit_replication":"https://pith.science/pith/SUKURWIFFRESUY3MBWZPFT3YYO/action/replication_record"}},"created_at":"2026-05-24T04:28:47.139279+00:00","updated_at":"2026-05-24T04:28:47.139279+00:00"}