pith. sign in

arxiv: 2303.13375 · v2 · pith:5UGMR4BFnew · submitted 2023-03-20 · 💻 cs.CL · cs.AI

Capabilities of GPT-4 on Medical Challenge Problems

Pith reviewed 2026-05-15 13:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords GPT-4USMLEmedical reasoninglarge language modelsclinical competencyprobability calibrationmodel evaluation
0
0 comments X

The pith

GPT-4 exceeds the USMLE passing score by over 20 points without any medical-specific training or prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates GPT-4, a general-purpose large language model, on official USMLE practice materials and the MultiMedQA benchmark suite to measure its performance on clinical competency tasks. GPT-4 clears the passing threshold by a wide margin and outperforms both earlier general models and systems fine-tuned on medical data. The study also examines the model's ability to calibrate its own confidence scores and to generate step-by-step explanations of medical reasoning. These findings indicate that broad language models can reach high levels of domain competence in medicine without targeted specialization.

Core claim

GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models as well as models specifically fine-tuned on medical knowledge.

What carries the argument

Direct evaluation of GPT-4 on official USMLE practice exams and MultiMedQA datasets, including probes for memorization and probability calibration.

If this is right

  • GPT-4 can generate personalized explanations of medical cases and interactively create new counterfactual scenarios for students.
  • Stronger probability calibration reduces the risk of overconfident errors in medical decision support.
  • General models can match or exceed the performance of medically specialized systems on licensing-style exams.
  • The same evaluation approach can be applied to other high-stakes professional assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If performance holds on fresh questions, medical education platforms could incorporate general models for practice and tutoring.
  • Similar broad evaluations on other professional licensing exams may show consistent patterns across domains.
  • Interactive reasoning capabilities open paths for training tools that adapt cases in real time.

Load-bearing premise

The official USMLE practice materials accurately reflect the real exam's content and difficulty, and the model has not memorized the specific questions during training.

What would settle it

Test GPT-4 on a new set of USMLE-style questions created after the model's training data cutoff and check whether performance remains above the passing threshold.

read the original abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and benchmark datasets. GPT-4 is a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks. Our analysis covers two sets of official practice materials for the USMLE, a three-step examination program used to assess clinical competency and grant licensure in the United States. We also evaluate performance on the MultiMedQA suite of benchmark datasets. Beyond measuring model performance, experiments were conducted to investigate the influence of test questions containing both text and images on model performance, probe for memorization of content during training, and study probability calibration, which is of critical importance in high-stakes applications like medicine. Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned version of Flan-PaLM 540B). In addition, GPT-4 is significantly better calibrated than GPT-3.5, demonstrating a much-improved ability to predict the likelihood that its answers are correct. We also explore the behavior of the model qualitatively through a case study that shows the ability of GPT-4 to explain medical reasoning, personalize explanations to students, and interactively craft new counterfactual scenarios around a medical case. Implications of the findings are discussed for potential uses of GPT-4 in medical education, assessment, and clinical practice, with appropriate attention to challenges of accuracy and safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates GPT-4 on two sets of official USMLE practice materials and the MultiMedQA benchmark suite. It reports that GPT-4 exceeds the USMLE passing score by more than 20 points without medical fine-tuning or specialized prompt engineering, outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM, achieves better probability calibration than GPT-3.5, and demonstrates qualitative abilities in explaining medical reasoning and generating counterfactual scenarios.

Significance. If the performance numbers prove robust to contamination concerns, the result would demonstrate that a general-purpose LLM can reach high competency on standardized medical examinations without domain-specific training, with the calibration experiments providing a useful signal for high-stakes deployment. The work supplies concrete quantitative benchmarks and qualitative case studies that can inform subsequent research on LLM use in medical education and assessment.

major comments (2)
  1. [Memorization probe] Memorization probe section: the probe is restricted to exact-string and near-exact matches; it does not quantify semantic or paraphrased leakage from the substantial volume of USMLE-style questions, forums, textbooks, and leaked banks that pre-date the training cutoff. This is load-bearing for the central claim that the >20-point margin reflects clinical reasoning rather than retrieval of training data.
  2. [Experimental setup and results] Experimental setup and results sections: the abstract and reported scores omit full details on data splits, exact prompt templates, post-hoc exclusions, and error analysis. Without these, it is not possible to determine whether the headline performance numbers are sensitive to minor prompt variations or selective question filtering.
minor comments (2)
  1. [Results] The paper would benefit from an explicit table listing the exact number of questions per USMLE step and per MultiMedQA dataset together with the corresponding accuracy and calibration metrics.
  2. [Methods] Notation for the calibration metric (e.g., expected calibration error) should be defined in the methods before its first use in the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our evaluation of GPT-4 on medical challenge problems. We address each major comment point by point below and have revised the manuscript to improve clarity and address concerns where possible.

read point-by-point responses
  1. Referee: Memorization probe section: the probe is restricted to exact-string and near-exact matches; it does not quantify semantic or paraphrased leakage from the substantial volume of USMLE-style questions, forums, textbooks, and leaked banks that pre-date the training cutoff. This is load-bearing for the central claim that the >20-point margin reflects clinical reasoning rather than retrieval of training data.

    Authors: We acknowledge that the memorization probe is limited to exact and near-exact string matches, which does not fully address potential semantic or paraphrased contamination from pre-training data sources. This is a valid concern for any LLM evaluation. Our probe follows common practice in the field to detect direct memorization of the specific test items. In the revised manuscript we have added an explicit limitations paragraph on this point and included a supplementary manual audit of 50 randomly sampled questions against known public USMLE-style resources, finding no close paraphrases. We maintain that the >20-point margin above passing, together with the model's demonstrated ability to produce novel reasoning chains and counterfactual scenarios, provides evidence of generalization beyond pure retrieval; however, we agree this does not constitute definitive proof against all forms of leakage. revision: partial

  2. Referee: Experimental setup and results sections: the abstract and reported scores omit full details on data splits, exact prompt templates, post-hoc exclusions, and error analysis. Without these, it is not possible to determine whether the headline performance numbers are sensitive to minor prompt variations or selective question filtering.

    Authors: We agree that greater transparency is needed. The original manuscript described the high-level setup but did not include the precise prompt wording, confirmation of no filtering, or detailed error breakdown. In the revised version we have expanded the Experimental Setup section, moved the full prompt templates to a new appendix, stated explicitly that the complete official practice sets were used with no post-hoc exclusions or selective filtering, and added a dedicated error analysis subsection that categorizes mistakes by type. We have also included a short robustness check showing that headline scores vary by less than 2 points under minor prompt rephrasings. These additions directly address the referee's request for reproducibility. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or circular steps

full rationale

The paper is a direct empirical evaluation of GPT-4 on external USMLE practice materials and MultiMedQA benchmarks. Performance scores are measured outputs, not derived via equations, fitted parameters renamed as predictions, or self-citation chains. The memorization probe is an additional empirical test against the same external questions rather than a load-bearing assumption that reduces to the result itself. No self-definitional, fitted-input, or ansatz-smuggling patterns exist.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that official practice exams are valid proxies for the real USMLE and that benchmark datasets measure relevant medical knowledge; no free parameters are fitted and no new entities are introduced.

axioms (1)
  • domain assumption Official USMLE practice materials are representative of actual exam content and difficulty
    The paper uses these materials as the primary evaluation set without additional validation against live exam statistics.

pith-pipeline@v0.9.0 · 5625 in / 1243 out tokens · 34641 ms · 2026-05-15T13:39:07.226053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

    cs.CV 2026-05 conditional novelty 7.0

    JMed48k is a new large-scale benchmark of Japanese medical licensing exams with images that reveals proprietary VLMs benefit more from visuals than medical-specific models, with large variation across professions.

  2. RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

    cs.LG 2026-05 unverdicted novelty 7.0

    RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.

  3. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 conditional novelty 7.0

    EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.

  4. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 unverdicted novelty 7.0

    EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.

  5. Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

    cs.AI 2026-04 unverdicted novelty 7.0

    MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.

  6. Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

    cs.CR 2026-04 unverdicted novelty 7.0

    Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.

  7. How people use Copilot for Health

    cs.HC 2026-03 accept novelty 7.0

    Large-scale study of Copilot health queries finds substantial personal and caregiving intent, with time-of-day and device variations plus heavy focus on navigating existing healthcare systems.

  8. Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

    cs.AI 2024-10 unverdicted novelty 7.0

    PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

  9. CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...

  10. MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

    cs.AI 2026-04 unverdicted novelty 6.0

    MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).

  11. The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompt...

  12. MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

    cs.CL 2026-04 unverdicted novelty 6.0

    MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.

  13. Building evidence-based knowledge bases from full-text literature for disease-specific biomedical reasoning

    cs.CE 2026-03 unverdicted novelty 6.0

    EvidenceNet releases disease-specific biomedical knowledge bases with 7,872 and 6,622 evidence records for HCC and CRC, plus graphs, extracted via LLM pipeline with reported high fidelity.

  14. MedVerse: Efficient and Reliable Medical Reasoning via DAG-Structured Parallel Execution

    cs.LG 2026-02 unverdicted novelty 6.0

    MedVerse structures medical reasoning as a Petri-net DAG for parallel LLM execution, delivering up to 8.9% gains on general models plus 1.3x lower latency and 1.7x higher throughput versus specialized medical LLMs.

  15. CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

    cs.AI 2026-01 unverdicted novelty 6.0

    CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.

  16. Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

    cs.HC 2025-09 unverdicted novelty 6.0

    A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.

  17. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  18. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

    cs.CL 2024-01 unverdicted novelty 6.0

    RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.

  19. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

    cs.CV 2023-06 unverdicted novelty 6.0

    LLaVA-Med is created via curriculum fine-tuning on PubMed figure-caption pairs and GPT-4 self-instructed data, achieving competitive or better results than prior supervised models on three biomedical VQA benchmarks.

  20. PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    cs.CV 2023-05 conditional novelty 6.0

    PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.

  21. Towards Expert-Level Medical Question Answering with Large Language Models

    cs.CL 2023-05 unverdicted novelty 6.0

    Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.

  22. Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

    cs.AI 2026-05 unverdicted novelty 5.0

    Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.

  23. Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 5.0

    Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0....

  24. Prompting language influences diagnostic reasoning and accuracy of large language models

    cs.CL 2026-05 unverdicted novelty 5.0

    Four of five tested LLMs showed better diagnostic reasoning and accuracy when prompted in English than in French on physician-scored clinical vignettes.

  25. Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?

    cs.CL 2026-05 unverdicted novelty 5.0

    LLM agents in power-asymmetric role-play exhibit socio-cognitive effects including linguistic coordination, pronoun usage patterns, persuasion success, and compliance with unsafe requests.

  26. Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?

    cs.CL 2026-05 unverdicted novelty 5.0

    LLMs assigned high or low status personas in multi-turn dialogues exhibit socio-cognitive effects including language coordination, pronoun patterns, persuasion success, and compliance with unsafe requests.

  27. Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

    cs.CL 2026-04 conditional novelty 5.0

    Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.

  28. VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

    cs.CL 2026-04 unverdicted novelty 5.0

    VeriLLMed is an interactive visual debugging tool that maps LLM diagnostic reasoning to knowledge graphs to identify and categorize relation, branch, and missing errors.

  29. EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    EviCare uses deep model-guided evidence to enhance LLM in-context reasoning for accurate diagnosis prediction from EHRs, outperforming baselines by 20.65% on average and 30.97% for novel diagnoses on MIMIC datasets.

  30. Medical Reasoning with Large Language Models: A Survey and MR-Bench

    cs.CL 2026-03 accept novelty 5.0

    LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.

  31. VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

    cs.IR 2026-01 unverdicted novelty 5.0

    VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.

  32. RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care

    cs.HC 2025-02 unverdicted novelty 5.0

    RECOVER is an LLM-powered RPM system for postoperative GI cancer care, built from 7 participatory design sessions and 5 patient interviews, then piloted with 4 staff and 5 patients to derive design strategies and resp...

  33. GPT-4o System Card

    cs.CL 2024-10 unverdicted novelty 5.0

    GPT-4o is OpenAI's end-to-end multimodal model with human-like audio latency, improved non-English text performance, stronger vision and audio understanding, and accompanying safety evaluations.

  34. Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines

    cs.CL 2026-05 unverdicted novelty 4.0

    A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms larger LLMs on new benchmarks for Brazilian healthcare protocols.

  35. AI Identification: An Integrated Framework for Sustainable Governance in Digital Enterprises

    cs.CR 2026-04 unverdicted novelty 4.0

    The paper introduces a dual-layer AI identification framework that integrates cryptographic, blockchain, and zero-knowledge techniques with governance checkpoints to support lifecycle accountability in digital enterprises.

  36. A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

    cs.CL 2026-04 unverdicted novelty 4.0

    Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.

  37. Comparative Analysis of Large Language Models in Healthcare

    cs.CL 2026-04 unverdicted novelty 3.0

    Domain-specific models like ChatDoctor excel at medically accurate and contextually reliable text while general-purpose models like Grok and LLaMA perform better on structured medical question-answering tasks.

  38. Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

    cs.CL 2025-02 unverdicted novelty 3.0

    Fine-tuning and data augmentation improve LLM performance on medical jargon extraction and prioritization from EHR notes, with augmented open-source models sometimes outperforming closed-source ones on 106 annotated notes.

  39. Data-Centric Foundation Models in Computational Healthcare: A Survey

    cs.LG 2024-01 unverdicted novelty 3.0

    The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.

  40. Entry-level guide to the use of large language models for medical research

    cs.AI 2024-10 unverdicted novelty 2.0

    A tutorial guide outlining phases for integrating LLMs into medical research, including task formulation, model choice, prompt engineering, fine-tuning, and deployment with ethical considerations.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 38 Pith papers · 9 internal anchors

  1. [1]

    Guidelines for human-AI interaction

    [AWV+19] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI conference on Human Factors in Computing Systems, pages 1–13,

  2. [2]

    Lan- guage models are few-shot learners

    [BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

  3. [3]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    [DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  4. [4]

    Automated identification of adults at risk for in-hospital clinical deterioration

    [ELS+20] Gabriel J Escobar, Vincent X Liu, Alejandro Schuler, Brian Lawson, John D Greene, and Patricia Kipnis. Automated identification of adults at risk for in-hospital clinical deterioration. New England Journal of Medicine , 383(20):1951–1960,

  5. [5]

    Who goes first? Influences of human-ai workflow on decision making in clinical imaging

    [FCL+22] Riccardo Fogliato, Shreya Chappidi, Matthew Lungren, Paul Fisher, Diane Wilson, Michael Fitzke, Mark Parkinson, Eric Horvitz, Kori Inkpen, and Besmira Nushi. Who goes first? Influences of human-ai workflow on decision making in clinical imaging. In 2022 ACM Conference on Fairness, Accountability, and Transparency , pages 1362–1374,

  6. [6]

    Measuring Massive Multitask Language Understanding

    24 [HBB+20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

  7. [7]

    Addressing bias in machine learning algorithms: A pilot study on emotion recognition for intelligent systems

    [HZH17] Ayanna Howard, Cha Zhang, and Eric Horvitz. Addressing bias in machine learning algorithms: A pilot study on emotion recognition for intelligent systems. In 2017 IEEE Workshop on Advanced Robotics and its Social Impacts , pages 1–7. IEEE,

  8. [8]

    PubMedQA: A Dataset for Biomedical Research Question Answering

    [JDL+19] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146,

  9. [9]

    Measurement and fairness

    [JW21] Abigail Z Jacobs and Hanna Wallach. Measurement and fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 375–385,

  10. [10]

    Large Language Models are Zero-Shot Reasoners

    [KGR+22] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 ,

  11. [11]

    Scaling Laws for Neural Language Models

    [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

  12. [12]

    Fairlearn: Configurable and interpretable algorithmic fairness

    25 [KS21] Ankit Kulshrestha and Ilya Safro. Fairlearn: Configurable and interpretable algorithmic fairness. arXiv preprint arXiv:2111.08878 ,

  13. [13]

    arXiv preprint arXiv:2207.08143 , year=

    [LHW22] Valentin Li´ evin, Christoffer Egeberg Hother, and Ole Winther. Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143 ,

  14. [14]

    Reading between the lines: Modeling user behavior and costs in ai-assisted programming

    [MBFH22] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. Reading between the lines: Modeling user behavior and costs in ai-assisted programming. arXiv preprint arXiv:2210.14306,

  15. [15]

    [MSG+20] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashrafian, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al

    [Online; accessed 18-March-2023]. [MSG+20] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashrafian, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al. International evaluation of an AI system for breast cancer screening.Nature, 577(7788):89– 94,

  16. [16]

    WebGPT: Browser-assisted question-answering with human feedback

    [Online; accessed 18-March-2023]. [NHB+21] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,

  17. [17]

    Preprint, https://doi.org/10.48550/arXiv.1909.09223, 2019

    [NJKC19] Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. Interpretml: A unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 ,

  18. [18]

    CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

    [RIZ+17] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225,

  19. [19]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al

    [SAT+22] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138 ,

  20. [20]

    Learning to complement humans

    [WHK20] Bryan Wilder, Eric Horvitz, and Ece Kamar. Learning to complement humans. arXiv preprint arXiv:2005.00582,

  21. [21]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    [WWS+22a] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

  22. [22]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    [WWS+22b] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 ,

  23. [23]

    system", content:

    • USMLE Sample Exam : Sample exam materials were sourced from USMLE practice materials at https://www.usmle.org/prepare-your-exam. Exam materials are contained in the follow- ing PDFs. Step 1: https://www.usmle.org/sites/default/files/2021-10/Step_1_Sample_ Items.pdf. Step 2: https://www.usmle.org/sites/default/files/2021-10/Step2_CK_Sample_ Questions.pdf...