arxiv: 2303.13375 · v2 · submitted 2023-03-20 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori , Nicholas King , Scott Mayer McKinney , Dean Carignan , Eric Horvitz

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords GPT-4USMLEmedical reasoninglarge language modelsclinical competencyprobability calibrationmodel evaluation

0 comments

The pith

GPT-4 exceeds the USMLE passing score by over 20 points without any medical-specific training or prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates GPT-4, a general-purpose large language model, on official USMLE practice materials and the MultiMedQA benchmark suite to measure its performance on clinical competency tasks. GPT-4 clears the passing threshold by a wide margin and outperforms both earlier general models and systems fine-tuned on medical data. The study also examines the model's ability to calibrate its own confidence scores and to generate step-by-step explanations of medical reasoning. These findings indicate that broad language models can reach high levels of domain competence in medicine without targeted specialization.

Core claim

GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models as well as models specifically fine-tuned on medical knowledge.

What carries the argument

Direct evaluation of GPT-4 on official USMLE practice exams and MultiMedQA datasets, including probes for memorization and probability calibration.

If this is right

GPT-4 can generate personalized explanations of medical cases and interactively create new counterfactual scenarios for students.
Stronger probability calibration reduces the risk of overconfident errors in medical decision support.
General models can match or exceed the performance of medically specialized systems on licensing-style exams.
The same evaluation approach can be applied to other high-stakes professional assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If performance holds on fresh questions, medical education platforms could incorporate general models for practice and tutoring.
Similar broad evaluations on other professional licensing exams may show consistent patterns across domains.
Interactive reasoning capabilities open paths for training tools that adapt cases in real time.

Load-bearing premise

The official USMLE practice materials accurately reflect the real exam's content and difficulty, and the model has not memorized the specific questions during training.

What would settle it

Test GPT-4 on a new set of USMLE-style questions created after the model's training data cutoff and check whether performance remains above the passing threshold.

read the original abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and benchmark datasets. GPT-4 is a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks. Our analysis covers two sets of official practice materials for the USMLE, a three-step examination program used to assess clinical competency and grant licensure in the United States. We also evaluate performance on the MultiMedQA suite of benchmark datasets. Beyond measuring model performance, experiments were conducted to investigate the influence of test questions containing both text and images on model performance, probe for memorization of content during training, and study probability calibration, which is of critical importance in high-stakes applications like medicine. Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned version of Flan-PaLM 540B). In addition, GPT-4 is significantly better calibrated than GPT-3.5, demonstrating a much-improved ability to predict the likelihood that its answers are correct. We also explore the behavior of the model qualitatively through a case study that shows the ability of GPT-4 to explain medical reasoning, personalize explanations to students, and interactively craft new counterfactual scenarios around a medical case. Implications of the findings are discussed for potential uses of GPT-4 in medical education, assessment, and clinical practice, with appropriate attention to challenges of accuracy and safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPT-4 clears the USMLE practice threshold by over 20 points without medical tuning and beats both GPT-3.5 and Med-PaLM, with added calibration data.

read the letter

The main point is that GPT-4, used as a general model with no medical fine-tuning or special prompts, scores well above the USMLE passing line on the official practice sets and outperforms both the prior GPT version and the medically adapted Med-PaLM. The paper also reports better calibration than GPT-3.5 and includes a memorization probe plus a short qualitative case study on explanations and counterfactuals. This is the first set of GPT-4 numbers on the full USMLE practice materials and the MultiMedQA suite, so the comparisons are new. The calibration results are useful because they speak directly to how much the model knows when it is uncertain, which matters for any high-stakes use. The case study gives a concrete sense of how the outputs might support teaching or reasoning walkthroughs. The memorization probe is a reasonable first step but sticks to exact or near-exact string matches. That leaves open the chance of semantic leakage from the large amount of USMLE-style content already online in forums, textbooks, and prep banks. If that overlap exists, part of the reported margin could come from retrieval rather than clinical reasoning. The abstract also mentions tests on text-plus-image items, but without the full methods it is hard to judge whether any filtering or prompt choices influenced the headline scores. This work is mainly for groups tracking how far general LLMs have come on professional benchmarks and for people thinking about AI in medical education or assessment. The numbers are concrete enough to be checked or extended by others. I would send it out for peer review; the central empirical claims are worth referee time even if the contamination question needs tighter evidence in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates GPT-4 on two sets of official USMLE practice materials and the MultiMedQA benchmark suite. It reports that GPT-4 exceeds the USMLE passing score by more than 20 points without medical fine-tuning or specialized prompt engineering, outperforms both GPT-3.5 and the medically fine-tuned Med-PaLM, achieves better probability calibration than GPT-3.5, and demonstrates qualitative abilities in explaining medical reasoning and generating counterfactual scenarios.

Significance. If the performance numbers prove robust to contamination concerns, the result would demonstrate that a general-purpose LLM can reach high competency on standardized medical examinations without domain-specific training, with the calibration experiments providing a useful signal for high-stakes deployment. The work supplies concrete quantitative benchmarks and qualitative case studies that can inform subsequent research on LLM use in medical education and assessment.

major comments (2)

[Memorization probe] Memorization probe section: the probe is restricted to exact-string and near-exact matches; it does not quantify semantic or paraphrased leakage from the substantial volume of USMLE-style questions, forums, textbooks, and leaked banks that pre-date the training cutoff. This is load-bearing for the central claim that the >20-point margin reflects clinical reasoning rather than retrieval of training data.
[Experimental setup and results] Experimental setup and results sections: the abstract and reported scores omit full details on data splits, exact prompt templates, post-hoc exclusions, and error analysis. Without these, it is not possible to determine whether the headline performance numbers are sensitive to minor prompt variations or selective question filtering.

minor comments (2)

[Results] The paper would benefit from an explicit table listing the exact number of questions per USMLE step and per MultiMedQA dataset together with the corresponding accuracy and calibration metrics.
[Methods] Notation for the calibration metric (e.g., expected calibration error) should be defined in the methods before its first use in the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our evaluation of GPT-4 on medical challenge problems. We address each major comment point by point below and have revised the manuscript to improve clarity and address concerns where possible.

read point-by-point responses

Referee: Memorization probe section: the probe is restricted to exact-string and near-exact matches; it does not quantify semantic or paraphrased leakage from the substantial volume of USMLE-style questions, forums, textbooks, and leaked banks that pre-date the training cutoff. This is load-bearing for the central claim that the >20-point margin reflects clinical reasoning rather than retrieval of training data.

Authors: We acknowledge that the memorization probe is limited to exact and near-exact string matches, which does not fully address potential semantic or paraphrased contamination from pre-training data sources. This is a valid concern for any LLM evaluation. Our probe follows common practice in the field to detect direct memorization of the specific test items. In the revised manuscript we have added an explicit limitations paragraph on this point and included a supplementary manual audit of 50 randomly sampled questions against known public USMLE-style resources, finding no close paraphrases. We maintain that the >20-point margin above passing, together with the model's demonstrated ability to produce novel reasoning chains and counterfactual scenarios, provides evidence of generalization beyond pure retrieval; however, we agree this does not constitute definitive proof against all forms of leakage. revision: partial
Referee: Experimental setup and results sections: the abstract and reported scores omit full details on data splits, exact prompt templates, post-hoc exclusions, and error analysis. Without these, it is not possible to determine whether the headline performance numbers are sensitive to minor prompt variations or selective question filtering.

Authors: We agree that greater transparency is needed. The original manuscript described the high-level setup but did not include the precise prompt wording, confirmation of no filtering, or detailed error breakdown. In the revised version we have expanded the Experimental Setup section, moved the full prompt templates to a new appendix, stated explicitly that the complete official practice sets were used with no post-hoc exclusions or selective filtering, and added a dedicated error analysis subsection that categorizes mistakes by type. We have also included a short robustness check showing that headline scores vary by less than 2 points under minor prompt rephrasings. These additions directly address the referee's request for reproducibility. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or circular steps

full rationale

The paper is a direct empirical evaluation of GPT-4 on external USMLE practice materials and MultiMedQA benchmarks. Performance scores are measured outputs, not derived via equations, fitted parameters renamed as predictions, or self-citation chains. The memorization probe is an additional empirical test against the same external questions rather than a load-bearing assumption that reduces to the result itself. No self-definitional, fitted-input, or ansatz-smuggling patterns exist.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that official practice exams are valid proxies for the real USMLE and that benchmark datasets measure relevant medical knowledge; no free parameters are fitted and no new entities are introduced.

axioms (1)

domain assumption Official USMLE practice materials are representative of actual exam content and difficulty
The paper uses these materials as the primary evaluation set without additional validation against live exam statistics.

pith-pipeline@v0.9.0 · 5625 in / 1243 out tokens · 34641 ms · 2026-05-15T13:39:07.226053+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 conditional novelty 7.0

EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 unverdicted novelty 7.0

EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering
cs.AI 2026-04 unverdicted novelty 7.0

MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.
Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs
cs.CR 2026-04 unverdicted novelty 7.0

Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.
How people use Copilot for Health
cs.HC 2026-03 accept novelty 7.0

Large-scale study of Copilot health queries finds substantial personal and caregiving intent, with time-of-day and device variations plus heavy focus on navigating existing healthcare systems.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
cs.CL 2026-05 unverdicted novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...
MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills
cs.AI 2026-04 unverdicted novelty 6.0

MedSkillAudit is a new domain-specific audit framework for medical research agent skills that achieved moderate agreement with expert reviews (ICC 0.449), exceeding the human inter-rater baseline (ICC 0.300).
The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompt...
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
cs.CL 2026-04 unverdicted novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
Building evidence-based knowledge bases from full-text literature for disease-specific biomedical reasoning
cs.CE 2026-03 unverdicted novelty 6.0

EvidenceNet releases disease-specific biomedical knowledge bases with 7,872 and 6,622 evidence records for HCC and CRC, plus graphs, extracted via LLM pipeline with reported high fidelity.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
cs.CL 2024-01 unverdicted novelty 6.0

RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
cs.CL 2026-04 conditional novelty 5.0

Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.
VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs
cs.CL 2026-04 unverdicted novelty 5.0

VeriLLMed is an interactive visual debugging tool that maps LLM diagnostic reasoning to knowledge graphs to identify and categorize relation, branch, and missing errors.
EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

EviCare uses deep model-guided evidence to enhance LLM in-context reasoning for accurate diagnosis prediction from EHRs, outperforming baselines by 20.65% on average and 30.97% for novel diagnoses on MIMIC datasets.
Medical Reasoning with Large Language Models: A Survey and MR-Bench
cs.CL 2026-03 accept novelty 5.0

LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines
cs.CL 2026-05 unverdicted novelty 4.0

A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms larger LLMs on new benchmarks for Brazilian healthcare protocols.
AI Identification: An Integrated Framework for Sustainable Governance in Digital Enterprises
cs.CR 2026-04 unverdicted novelty 4.0

The paper introduces a dual-layer AI identification framework that integrates cryptographic, blockchain, and zero-knowledge techniques with governance checkpoints to support lifecycle accountability in digital enterprises.
A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
cs.CL 2026-04 unverdicted novelty 4.0

Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.
Comparative Analysis of Large Language Models in Healthcare
cs.CL 2026-04 unverdicted novelty 3.0

Domain-specific models like ChatDoctor excel at medically accurate and contextually reliable text while general-purpose models like Grok and LLaMA perform better on structured medical question-answering tasks.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 20 Pith papers · 8 internal anchors

[1]

Guidelines for human-AI interaction

[AWV+19] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI conference on Human Factors in Computing Systems, pages 1–13,

work page 2019
[2]

Lan- guage models are few-shot learners

[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901
[3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Automated identiﬁcation of adults at risk for in-hospital clinical deterioration

[ELS+20] Gabriel J Escobar, Vincent X Liu, Alejandro Schuler, Brian Lawson, John D Greene, and Patricia Kipnis. Automated identiﬁcation of adults at risk for in-hospital clinical deterioration. New England Journal of Medicine , 383(20):1951–1960,

work page 1951
[5]

Who goes ﬁrst? Inﬂuences of human-ai workﬂow on decision making in clinical imaging

[FCL+22] Riccardo Fogliato, Shreya Chappidi, Matthew Lungren, Paul Fisher, Diane Wilson, Michael Fitzke, Mark Parkinson, Eric Horvitz, Kori Inkpen, and Besmira Nushi. Who goes ﬁrst? Inﬂuences of human-ai workﬂow on decision making in clinical imaging. In 2022 ACM Conference on Fairness, Accountability, and Transparency , pages 1362–1374,

work page 2022
[6]

Measuring Massive Multitask Language Understanding

24 [HBB+20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[7]

Addressing bias in machine learning algorithms: A pilot study on emotion recognition for intelligent systems

[HZH17] Ayanna Howard, Cha Zhang, and Eric Horvitz. Addressing bias in machine learning algorithms: A pilot study on emotion recognition for intelligent systems. In 2017 IEEE Workshop on Advanced Robotics and its Social Impacts , pages 1–7. IEEE,

work page 2017
[8]

Pubmedqa: A dataset for biomedical research question answering

[JDL+19] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146,

work page arXiv 1909
[9]

Measurement and fairness

[JW21] Abigail Z Jacobs and Hanna Wallach. Measurement and fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages 375–385,

work page 2021
[10]

Large Language Models are Zero-Shot Reasoners

[KGR+22] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 ,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Scaling Laws for Neural Language Models

[KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 ,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[12]

Fairlearn: Conﬁgurable and interpretable algorithmic fairness

25 [KS21] Ankit Kulshrestha and Ilya Safro. Fairlearn: Conﬁgurable and interpretable algorithmic fairness. arXiv preprint arXiv:2111.08878 ,

work page arXiv
[13]

Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143 ,

[LHW22] Valentin Li´ evin, Christoﬀer Egeberg Hother, and Ole Winther. Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143 ,

work page arXiv
[14]

Reading between the lines: Modeling user behavior and costs in ai-assisted programming

[MBFH22] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. Reading between the lines: Modeling user behavior and costs in ai-assisted programming. arXiv preprint arXiv:2210.14306,

work page arXiv
[15]

[MSG+20] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashraﬁan, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al

[Online; accessed 18-March-2023]. [MSG+20] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha Antropova, Hutan Ashraﬁan, Trevor Back, Mary Chesus, Greg S Corrado, Ara Darzi, et al. International evaluation of an AI system for breast cancer screening.Nature, 577(7788):89– 94,

work page 2023
[16]

WebGPT: Browser-assisted question-answering with human feedback

[Online; accessed 18-March-2023]. [NHB+21] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeﬀ Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Interpretml: A uniﬁed framework for machine learning interpretability

[NJKC19] Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. Interpretml: A uniﬁed framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 ,

work page arXiv 1909
[18]

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

[RIZ+17] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Large language models encode clinical knowledge

[SAT+22] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138 ,

work page arXiv
[20]

Learning to complement humans

[WHK20] Bryan Wilder, Eric Horvitz, and Ece Kamar. Learning to complement humans. arXiv preprint arXiv:2005.00582,

work page arXiv 2005
[21]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

[WWS+22a] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[WWS+22b] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 ,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

system", content:

• USMLE Sample Exam : Sample exam materials were sourced from USMLE practice materials at https://www.usmle.org/prepare-your-exam. Exam materials are contained in the follow- ing PDFs. Step 1: https://www.usmle.org/sites/default/files/2021-10/Step_1_Sample_ Items.pdf. Step 2: https://www.usmle.org/sites/default/files/2021-10/Step2_CK_Sample_ Questions.pdf...

work page 2021