Capabilities of GPT-4 on Medical Challenge Problems

Dean Carignan; Eric Horvitz; Harsha Nori; Nicholas King; Scott Mayer McKinney

arxiv: 2303.13375 · v2 · pith:5UGMR4BFnew · submitted 2023-03-20 · 💻 cs.CL · cs.AI

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori , Nicholas King , Scott Mayer McKinney , Dean Carignan , Eric Horvitz This is my paper

Pith reviewed 2026-05-15 13:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords GPT-4USMLEmedical reasoninglarge language modelsclinical competencyprobability calibrationmodel evaluation

0 comments

The pith

GPT-4 exceeds the USMLE passing score by over 20 points without any medical-specific training or prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates GPT-4, a general-purpose large language model, on official USMLE practice materials and the MultiMedQA benchmark suite to measure its performance on clinical competency tasks. GPT-4 clears the passing threshold by a wide margin and outperforms both earlier general models and systems fine-tuned on medical data. The study also examines the model's ability to calibrate its own confidence scores and to generate step-by-step explanations of medical reasoning. These findings indicate that broad language models can reach high levels of domain competence in medicine without targeted specialization.

Core claim

GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models as well as models specifically fine-tuned on medical knowledge.

What carries the argument

Direct evaluation of GPT-4 on official USMLE practice exams and MultiMedQA datasets, including probes for memorization and probability calibration.

If this is right

GPT-4 can generate personalized explanations of medical cases and interactively create new counterfactual scenarios for students.
Stronger probability calibration reduces the risk of overconfident errors in medical decision support.
General models can match or exceed the performance of medically specialized systems on licensing-style exams.
The same evaluation approach can be applied to other high-stakes professional assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If performance holds on fresh questions, medical education platforms could incorporate general models for practice and tutoring.
Similar broad evaluations on other professional licensing exams may show consistent patterns across domains.
Interactive reasoning capabilities open paths for training tools that adapt cases in real time.

Load-bearing premise

The official USMLE practice materials accurately reflect the real exam's content and difficulty, and the model has not memorized the specific questions during training.

What would settle it

Test GPT-4 on a new set of USMLE-style questions created after the model's training data cutoff and check whether performance remains above the passing threshold.

read the original abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and benchmark datasets. GPT-4 is a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks. Our analysis covers two sets of official practice materials for the USMLE, a three-step examination program used to assess clinical competency and grant licensure in the United States. We also evaluate performance on the MultiMedQA suite of benchmark datasets. Beyond measuring model performance, experiments were conducted to investigate the influence of test questions containing both text and images on model performance, probe for memorization of content during training, and study probability calibration, which is of critical importance in high-stakes applications like medicine. Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned version of Flan-PaLM 540B). In addition, GPT-4 is significantly better calibrated than GPT-3.5, demonstrating a much-improved ability to predict the likelihood that its answers are correct. We also explore the behavior of the model qualitatively through a case study that shows the ability of GPT-4 to explain medical reasoning, personalize explanations to students, and interactively craft new counterfactual scenarios around a medical case. Implications of the findings are discussed for potential uses of GPT-4 in medical education, assessment, and clinical practice, with appropriate attention to challenges of accuracy and safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPT-4 clears the USMLE practice threshold by over 20 points without medical tuning and beats both GPT-3.5 and Med-PaLM, with added calibration data.

read the letter

The main point is that GPT-4, used as a general model with no medical fine-tuning or special prompts, scores well above the USMLE passing line on the official practice sets and outperforms both the prior GPT version and the medically adapted Med-PaLM. The paper also reports better calibration than GPT-3.5 and includes a memorization probe plus a short qualitative case study on explanations and counterfactuals. This is the first set of GPT-4 numbers on the full USMLE practice materials and the MultiMedQA suite, so the comparisons are new. The calibration results are useful because they speak directly to how much the model knows when it is uncertain, which matters for any high-stakes use. The case study gives a concrete sense of how the outputs might support teaching or reasoning walkthroughs. The memorization probe is a reasonable first step but sticks to exact or near-exact string matches. That leaves open the chance of semantic leakage from the large amount of USMLE-style content already online in forums, textbooks, and prep banks. If that overlap exists, part of the reported margin could come from retrieval rather than clinical reasoning. The abstract also mentions tests on text-plus-image items, but without the full methods it is hard to judge whether any filtering or prompt choices influenced the headline scores. This work is mainly for groups tracking how far general LLMs have come on professional benchmarks and for people thinking about AI in medical education or assessment. The numbers are concrete enough to be checked or extended by others. I would send it out for peer review; the central empirical claims are worth referee time even if the contamination question needs tighter evidence in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates GPT-4 on two sets of official USMLE practice materials and the MultiMedQA benchmark suite. It reports that GPT-4 exceeds the USMLE passing score by more than 20 points without medical fine-tuning or specialized prompt engineering, outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM, achieves better probability calibration than GPT-3.5, and demonstrates qualitative abilities in explaining medical reasoning and generating counterfactual scenarios.

Significance. If the performance numbers prove robust to contamination concerns, the result would demonstrate that a general-purpose LLM can reach high competency on standardized medical examinations without domain-specific training, with the calibration experiments providing a useful signal for high-stakes deployment. The work supplies concrete quantitative benchmarks and qualitative case studies that can inform subsequent research on LLM use in medical education and assessment.

major comments (2)

[Memorization probe] Memorization probe section: the probe is restricted to exact-string and near-exact matches; it does not quantify semantic or paraphrased leakage from the substantial volume of USMLE-style questions, forums, textbooks, and leaked banks that pre-date the training cutoff. This is load-bearing for the central claim that the >20-point margin reflects clinical reasoning rather than retrieval of training data.
[Experimental setup and results] Experimental setup and results sections: the abstract and reported scores omit full details on data splits, exact prompt templates, post-hoc exclusions, and error analysis. Without these, it is not possible to determine whether the headline performance numbers are sensitive to minor prompt variations or selective question filtering.

minor comments (2)

[Results] The paper would benefit from an explicit table listing the exact number of questions per USMLE step and per MultiMedQA dataset together with the corresponding accuracy and calibration metrics.
[Methods] Notation for the calibration metric (e.g., expected calibration error) should be defined in the methods before its first use in the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our evaluation of GPT-4 on medical challenge problems. We address each major comment point by point below and have revised the manuscript to improve clarity and address concerns where possible.

read point-by-point responses

Referee: Memorization probe section: the probe is restricted to exact-string and near-exact matches; it does not quantify semantic or paraphrased leakage from the substantial volume of USMLE-style questions, forums, textbooks, and leaked banks that pre-date the training cutoff. This is load-bearing for the central claim that the >20-point margin reflects clinical reasoning rather than retrieval of training data.

Authors: We acknowledge that the memorization probe is limited to exact and near-exact string matches, which does not fully address potential semantic or paraphrased contamination from pre-training data sources. This is a valid concern for any LLM evaluation. Our probe follows common practice in the field to detect direct memorization of the specific test items. In the revised manuscript we have added an explicit limitations paragraph on this point and included a supplementary manual audit of 50 randomly sampled questions against known public USMLE-style resources, finding no close paraphrases. We maintain that the >20-point margin above passing, together with the model's demonstrated ability to produce novel reasoning chains and counterfactual scenarios, provides evidence of generalization beyond pure retrieval; however, we agree this does not constitute definitive proof against all forms of leakage. revision: partial
Referee: Experimental setup and results sections: the abstract and reported scores omit full details on data splits, exact prompt templates, post-hoc exclusions, and error analysis. Without these, it is not possible to determine whether the headline performance numbers are sensitive to minor prompt variations or selective question filtering.

Authors: We agree that greater transparency is needed. The original manuscript described the high-level setup but did not include the precise prompt wording, confirmation of no filtering, or detailed error breakdown. In the revised version we have expanded the Experimental Setup section, moved the full prompt templates to a new appendix, stated explicitly that the complete official practice sets were used with no post-hoc exclusions or selective filtering, and added a dedicated error analysis subsection that categorizes mistakes by type. We have also included a short robustness check showing that headline scores vary by less than 2 points under minor prompt rephrasings. These additions directly address the referee's request for reproducibility. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or circular steps

full rationale

The paper is a direct empirical evaluation of GPT-4 on external USMLE practice materials and MultiMedQA benchmarks. Performance scores are measured outputs, not derived via equations, fitted parameters renamed as predictions, or self-citation chains. The memorization probe is an additional empirical test against the same external questions rather than a load-bearing assumption that reduces to the result itself. No self-definitional, fitted-input, or ansatz-smuggling patterns exist.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that official practice exams are valid proxies for the real USMLE and that benchmark datasets measure relevant medical knowledge; no free parameters are fitted and no new entities are introduced.

axioms (1)

domain assumption Official USMLE practice materials are representative of actual exam content and difficulty
The paper uses these materials as the primary evaluation set without additional validation against live exam statistics.

pith-pipeline@v0.9.0 · 5625 in / 1243 out tokens · 34641 ms · 2026-05-15T13:39:07.226053+00:00 · methodology

discussion (0)

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
cs.CV 2026-05 conditional novelty 7.0

JMed48k is a new large-scale benchmark of Japanese medical licensing exams with images that reveals proprietary VLMs benefit more from visuals than medical-specific models, with large variation across professions.
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 conditional novelty 7.0

EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 unverdicted novelty 7.0

EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering
cs.AI 2026-04 unverdicted novelty 7.0

MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.
Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs
cs.CR 2026-04 unverdicted novelty 7.0

Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.
How people use Copilot for Health
cs.HC 2026-03 accept novelty 7.0

Large-scale study of Copilot health queries finds substantial personal and caregiving intent, with time-of-day and device variations plus heavy focus on navigating existing healthcare systems.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
cs.AI 2024-10 unverdicted novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
cs.CL 2026-05 unverdicted novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...
MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills
cs.AI 2026-04 unverdicted novelty 6.0

MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).
The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompt...
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
cs.CL 2026-04 unverdicted novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
Building evidence-based knowledge bases from full-text literature for disease-specific biomedical reasoning
cs.CE 2026-03 unverdicted novelty 6.0

EvidenceNet releases disease-specific biomedical knowledge bases with 7,872 and 6,622 evidence records for HCC and CRC, plus graphs, extracted via LLM pipeline with reported high fidelity.
MedVerse: Efficient and Reliable Medical Reasoning via DAG-Structured Parallel Execution
cs.LG 2026-02 unverdicted novelty 6.0

MedVerse structures medical reasoning as a Petri-net DAG for parallel LLM execution, delivering up to 8.9% gains on general models plus 1.3x lower latency and 1.7x higher throughput versus specialized medical LLMs.
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
cs.AI 2026-01 unverdicted novelty 6.0

CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
cs.HC 2025-09 unverdicted novelty 6.0

A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
cs.CL 2024-01 unverdicted novelty 6.0

RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
cs.CV 2023-06 unverdicted novelty 6.0

LLaVA-Med is created via curriculum fine-tuning on PubMed figure-caption pairs and GPT-4 self-instructed data, achieving competitive or better results than prior supervised models on three biomedical VQA benchmarks.
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
cs.CV 2023-05 conditional novelty 6.0

PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.
Towards Expert-Level Medical Question Answering with Large Language Models
cs.CL 2023-05 unverdicted novelty 6.0

Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support
cs.AI 2026-05 unverdicted novelty 5.0

Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.
Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 5.0

Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0....
Prompting language influences diagnostic reasoning and accuracy of large language models
cs.CL 2026-05 unverdicted novelty 5.0

Four of five tested LLMs showed better diagnostic reasoning and accuracy when prompted in English than in French on physician-scored clinical vignettes.
Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?
cs.CL 2026-05 unverdicted novelty 5.0

LLM agents in power-asymmetric role-play exhibit socio-cognitive effects including linguistic coordination, pronoun usage patterns, persuasion success, and compliance with unsafe requests.
Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?
cs.CL 2026-05 unverdicted novelty 5.0

LLMs assigned high or low status personas in multi-turn dialogues exhibit socio-cognitive effects including language coordination, pronoun patterns, persuasion success, and compliance with unsafe requests.
Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
cs.CL 2026-04 conditional novelty 5.0

Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.
VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs
cs.CL 2026-04 unverdicted novelty 5.0

VeriLLMed is an interactive visual debugging tool that maps LLM diagnostic reasoning to knowledge graphs to identify and categorize relation, branch, and missing errors.
EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

EviCare uses deep model-guided evidence to enhance LLM in-context reasoning for accurate diagnosis prediction from EHRs, outperforming baselines by 20.65% on average and 30.97% for novel diagnoses on MIMIC datasets.
Medical Reasoning with Large Language Models: A Survey and MR-Bench
cs.CL 2026-03 accept novelty 5.0

LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering
cs.IR 2026-01 unverdicted novelty 5.0

VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.
RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care
cs.HC 2025-02 unverdicted novelty 5.0

RECOVER is an LLM-powered RPM system for postoperative GI cancer care, built from 7 participatory design sessions and 5 patient interviews, then piloted with 4 staff and 5 patients to derive design strategies and resp...
GPT-4o System Card
cs.CL 2024-10 unverdicted novelty 5.0

GPT-4o is OpenAI's end-to-end multimodal model with human-like audio latency, improved non-English text performance, stronger vision and audio understanding, and accompanying safety evaluations.
Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines
cs.CL 2026-05 unverdicted novelty 4.0

A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms larger LLMs on new benchmarks for Brazilian healthcare protocols.
AI Identification: An Integrated Framework for Sustainable Governance in Digital Enterprises
cs.CR 2026-04 unverdicted novelty 4.0

The paper introduces a dual-layer AI identification framework that integrates cryptographic, blockchain, and zero-knowledge techniques with governance checkpoints to support lifecycle accountability in digital enterprises.
A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
cs.CL 2026-04 unverdicted novelty 4.0

Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.
Comparative Analysis of Large Language Models in Healthcare
cs.CL 2026-04 unverdicted novelty 3.0

Domain-specific models like ChatDoctor excel at medically accurate and contextually reliable text while general-purpose models like Grok and LLaMA perform better on structured medical question-answering tasks.
Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation
cs.CL 2025-02 unverdicted novelty 3.0

Fine-tuning and data augmentation improve LLM performance on medical jargon extraction and prioritization from EHR notes, with augmented open-source models sometimes outperforming closed-source ones on 106 annotated notes.
Data-Centric Foundation Models in Computational Healthcare: A Survey
cs.LG 2024-01 unverdicted novelty 3.0

The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
Entry-level guide to the use of large language models for medical research
cs.AI 2024-10 unverdicted novelty 2.0

A tutorial guide outlining phases for integrating LLMs into medical research, including task formulation, model choice, prompt engineering, fine-tuning, and deployment with ethical considerations.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 38 Pith papers · 9 internal anchors

[1]

Guidelines for human-AI interaction

[AWV+19] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI conference on Human Factors in Computing Systems, pages 1–13,

work page 2019
[2]

Lan- guage models are few-shot learners

[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901
[3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Automated identiﬁcation of adults at risk for in-hospital clinical deterioration

[ELS+20] Gabriel J Escobar, Vincent X Liu, Alejandro Schuler, Brian Lawson, John D Greene, and Patricia Kipnis. Automated identiﬁcation of adults at risk for in-hospital clinical deterioration. New England Journal of Medicine , 383(20):1951–1960,

work page 1951
[5]

Who goes ﬁrst? Inﬂuences of human-ai workﬂow on decision making in clinical imaging

[FCL+22] Riccardo Fogliato, Shreya Chappidi, Matthew Lungren, Paul Fisher, Diane Wilson, Michael Fitzke, Mark Parkinson, Eric Horvitz, Kori Inkpen, and Besmira Nushi. Who goes ﬁrst? Inﬂuences of human-ai workﬂow on decision making in clinical imaging. In 2022 ACM Conference on Fairness, Accountability, and Transparency , pages 1362–1374,

work page 2022
[6]

Measuring Massive Multitask Language Understanding

24 [HBB+20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[7]

Addressing bias in machine learning algorithms: A pilot study on emotion recognition for intelligent systems

[HZH17] Ayanna Howard, Cha Zhang, and Eric Horvitz. Addressing bias in machine learning algorithms: A pilot study on emotion recognition for intelligent systems. In 2017 IEEE Workshop on Advanced Robotics and its Social Impacts , pages 1–7. IEEE,

work page 2017
[8]

PubMedQA: A Dataset for Biomedical Research Question Answering

[JDL+19] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146,

work page internal anchor Pith review arXiv 1909
[9]

Measurement and fairness

[JW21] Abigail Z Jacobs and Hanna Wallach. Measurement and fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 375–385,

work page 2021
[10]

Large Language Models are Zero-Shot Reasoners

[KGR+22] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 ,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Scaling Laws for Neural Language Models

[KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[12]

Fairlearn: Conﬁgurable and interpretable algorithmic fairness

25 [KS21] Ankit Kulshrestha and Ilya Safro. Fairlearn: Conﬁgurable and interpretable algorithmic fairness. arXiv preprint arXiv:2111.08878 ,

work page arXiv
[13]

arXiv preprint arXiv:2207.08143 , year=

[LHW22] Valentin Li´ evin, Christoﬀer Egeberg Hother, and Ole Winther. Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143 ,

work page arXiv
[14]

Reading between the lines: Modeling user behavior and costs in ai-assisted programming

[MBFH22] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. Reading between the lines: Modeling user behavior and costs in ai-assisted programming. arXiv preprint arXiv:2210.14306,

work page arXiv
[15]

[MSG+20] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashraﬁan, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al

[Online; accessed 18-March-2023]. [MSG+20] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashraﬁan, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al. International evaluation of an AI system for breast cancer screening.Nature, 577(7788):89– 94,

work page 2023
[16]

WebGPT: Browser-assisted question-answering with human feedback

[Online; accessed 18-March-2023]. [NHB+21] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeﬀ Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Preprint, https://doi.org/10.48550/arXiv.1909.09223, 2019

[NJKC19] Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. Interpretml: A uniﬁed framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 ,

work page arXiv 1909
[18]

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

[RIZ+17] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al

[SAT+22] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138 ,

work page arXiv
[20]

Learning to complement humans

[WHK20] Bryan Wilder, Eric Horvitz, and Ece Kamar. Learning to complement humans. arXiv preprint arXiv:2005.00582,

work page arXiv 2005
[21]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

[WWS+22a] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[WWS+22b] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 ,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

system", content:

• USMLE Sample Exam : Sample exam materials were sourced from USMLE practice materials at https://www.usmle.org/prepare-your-exam. Exam materials are contained in the follow- ing PDFs. Step 1: https://www.usmle.org/sites/default/files/2021-10/Step_1_Sample_ Items.pdf. Step 2: https://www.usmle.org/sites/default/files/2021-10/Step2_CK_Sample_ Questions.pdf...

work page 2021

[1] [1]

Guidelines for human-AI interaction

[AWV+19] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI conference on Human Factors in Computing Systems, pages 1–13,

work page 2019

[2] [2]

Lan- guage models are few-shot learners

[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901

[3] [3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Automated identiﬁcation of adults at risk for in-hospital clinical deterioration

[ELS+20] Gabriel J Escobar, Vincent X Liu, Alejandro Schuler, Brian Lawson, John D Greene, and Patricia Kipnis. Automated identiﬁcation of adults at risk for in-hospital clinical deterioration. New England Journal of Medicine , 383(20):1951–1960,

work page 1951

[5] [5]

Who goes ﬁrst? Inﬂuences of human-ai workﬂow on decision making in clinical imaging

[FCL+22] Riccardo Fogliato, Shreya Chappidi, Matthew Lungren, Paul Fisher, Diane Wilson, Michael Fitzke, Mark Parkinson, Eric Horvitz, Kori Inkpen, and Besmira Nushi. Who goes ﬁrst? Inﬂuences of human-ai workﬂow on decision making in clinical imaging. In 2022 ACM Conference on Fairness, Accountability, and Transparency , pages 1362–1374,

work page 2022

[6] [6]

Measuring Massive Multitask Language Understanding

24 [HBB+20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[7] [7]

Addressing bias in machine learning algorithms: A pilot study on emotion recognition for intelligent systems

[HZH17] Ayanna Howard, Cha Zhang, and Eric Horvitz. Addressing bias in machine learning algorithms: A pilot study on emotion recognition for intelligent systems. In 2017 IEEE Workshop on Advanced Robotics and its Social Impacts , pages 1–7. IEEE,

work page 2017

[8] [8]

PubMedQA: A Dataset for Biomedical Research Question Answering

[JDL+19] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146,

work page internal anchor Pith review arXiv 1909

[9] [9]

Measurement and fairness

[JW21] Abigail Z Jacobs and Hanna Wallach. Measurement and fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 375–385,

work page 2021

[10] [10]

Large Language Models are Zero-Shot Reasoners

[KGR+22] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 ,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Scaling Laws for Neural Language Models

[KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[12] [12]

Fairlearn: Conﬁgurable and interpretable algorithmic fairness

25 [KS21] Ankit Kulshrestha and Ilya Safro. Fairlearn: Conﬁgurable and interpretable algorithmic fairness. arXiv preprint arXiv:2111.08878 ,

work page arXiv

[13] [13]

arXiv preprint arXiv:2207.08143 , year=

[LHW22] Valentin Li´ evin, Christoﬀer Egeberg Hother, and Ole Winther. Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143 ,

work page arXiv

[14] [14]

Reading between the lines: Modeling user behavior and costs in ai-assisted programming

[MBFH22] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. Reading between the lines: Modeling user behavior and costs in ai-assisted programming. arXiv preprint arXiv:2210.14306,

work page arXiv

[15] [15]

[MSG+20] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashraﬁan, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al

[Online; accessed 18-March-2023]. [MSG+20] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashraﬁan, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al. International evaluation of an AI system for breast cancer screening.Nature, 577(7788):89– 94,

work page 2023

[16] [16]

WebGPT: Browser-assisted question-answering with human feedback

[Online; accessed 18-March-2023]. [NHB+21] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeﬀ Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Preprint, https://doi.org/10.48550/arXiv.1909.09223, 2019

[NJKC19] Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. Interpretml: A uniﬁed framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 ,

work page arXiv 1909

[18] [18]

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

[RIZ+17] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al

[SAT+22] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138 ,

work page arXiv

[20] [20]

Learning to complement humans

[WHK20] Bryan Wilder, Eric Horvitz, and Ece Kamar. Learning to complement humans. arXiv preprint arXiv:2005.00582,

work page arXiv 2005

[21] [21]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

[WWS+22a] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[WWS+22b] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 ,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

system", content:

• USMLE Sample Exam : Sample exam materials were sourced from USMLE practice materials at https://www.usmle.org/prepare-your-exam. Exam materials are contained in the follow- ing PDFs. Step 1: https://www.usmle.org/sites/default/files/2021-10/Step_1_Sample_ Items.pdf. Step 2: https://www.usmle.org/sites/default/files/2021-10/Step2_CK_Sample_ Questions.pdf...

work page 2021