arxiv: 2604.27470 · v1 · submitted 2026-04-30 · 💻 cs.CL

Recognition: unknown

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

Rebecca Soskin Hicks , Mikhail Trofimov , Dominick Lim , Rahul K. Arora , Foivos Tsimpourlas , Preston Bowman , Michael Sharman , Chi Tong

show 8 more authors

Kavin Karthik Arnav Dugar Akshay Jagadeesh Khaled Saab Johannes Heidecke Ashley Alexander Nate Gross Karan Singhal

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords HealthBench Professionallarge language modelsclinical evaluationChatGPTphysician rubricsbenchmarkadversarial testinghealthcare AI

0 comments

The pith

A specialized version of GPT-5.4 outperforms base models and human physicians on real clinician conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HealthBench Professional to measure large language models on tasks that clinicians actually bring to ChatGPT. It covers three core areas: care consults, writing and documentation, and medical research. Examples come from a large pool of real conversations, with selection favoring difficult and adversarial cases, and each is scored using rubrics written and reviewed by multiple physicians. Human physicians provided comparison responses with no time limit and full web access. The reported result is that GPT-5.4 inside ChatGPT for Clinicians scores highest overall.

Core claim

HealthBench Professional consists of physician-authored conversations with ChatGPT organized around care consult, documentation, and research use cases. Each example carries a rubric developed and iteratively adjudicated by three or more physicians. On this benchmark the specialized GPT-5.4 in ChatGPT for Clinicians outperforms the base GPT-5.4, competing models, and specialist-matched human physicians given unbounded time and web access.

What carries the argument

HealthBench Professional, a collection of real clinician-ChatGPT conversations scored by multi-physician rubrics across three clinical use cases.

If this is right

Specialized clinical models can exceed human physician performance on documentation and research support tasks.
Enrichment for adversarial cases provides a stricter test of model reliability in clinical settings.
Open benchmarks with physician rubrics allow ongoing tracking of frontier model progress on real workflows.
Human baselines collected under realistic conditions set a concrete target for future systems.
Multi-phase rubric adjudication offers a scalable method for grounding model evaluation in clinical judgment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the performance gap holds on broader unselected data, targeted fine-tuning on clinical chats may become a standard development step.
High scores on adversarial examples could guide safety testing for AI tools before wider deployment in patient care.
Extending the same rubric approach to live patient interactions would test whether benchmark gains translate outside simulated chats.
Connecting benchmark scores directly to downstream clinical outcomes remains an open measurement problem.

Load-bearing premise

The physician-written rubrics together with the selection of difficult and adversarial examples produce a representative measure of real clinical performance.

What would settle it

Independent physicians scoring a fresh random sample of ordinary clinician chats where human responses equal or exceed the specialized model's rubric scores.

Figures

Figures reproduced from arXiv: 2604.27470 by Akshay Jagadeesh, Arnav Dugar, Ashley Alexander, Chi Tong, Dominick Lim, Foivos Tsimpourlas, Johannes Heidecke, Karan Singhal, Kavin Karthik, Khaled Saab, Michael Sharman, Mikhail Trofimov, Nate Gross, Preston Bowman, Rahul K. Arora, Rebecca Soskin Hicks.

**Figure 1.** Figure 1: HealthBench Professional evaluates real clinician chat tasks across three common use cases. Ex view at source ↗

**Figure 2.** Figure 2: Likert-score distribution before and after review and stratified sampling of tasks, including both view at source ↗

**Figure 3.** Figure 3: Composition of HealthBench Professional. Left: distribution across the three use cases. Right: view at source ↗

**Figure 4.** Figure 4: HealthBench Professional score overall (left) and split by use case (right) across frontier models view at source ↗

**Figure 5.** Figure 5: HealthBench Professional score split by dataset slice. view at source ↗

**Figure 6.** Figure 6: HealthBench Professional score split by medical specialty. view at source ↗

**Figure 7.** Figure 7: GPT-5.4 in ChatGPT for Clinicians outperforms base GPT-5.4 and GPT-5.4 with browsing. view at source ↗

**Figure 8.** Figure 8: Effect of verbosity on HealthBench Professional unadjusted score and adjusted score. Adjusted view at source ↗

**Figure 9.** Figure 9: HealthBench Professional score as a function of mean reasoning characters. Lines connect reasoning view at source ↗

read the original abstract

Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians bring to ChatGPT in the course of their work. The benchmark is organized around three common use cases central to clinical practice: care consult, writing and documentation, and medical research. Each example includes a physician-authored conversation with ChatGPT for Clinicians and is scored via rubrics written and iteratively adjudicated by three or more physicians across three phases. HealthBench Professional examples were carefully selected for quality, representativeness, and difficulty for OpenAI's current frontier models, to enable continued measurement of progress. Difficult examples for recent OpenAI models were enriched by roughly 3.5 times relative to the candidate pool of 15,079 examples. Additionally, about one-third of examples involve physicians conducting deliberate adversarial testing of models. As a strong baseline, we also collected human physician responses for all tasks (unbounded time, specialist-matched, web access). The best scoring system, GPT-5.4 in ChatGPT for Clinicians, outperforms base GPT-5.4, all other models, and human physicians. We hope HealthBench Professional provides the healthcare AI community a measure to track frontier model progress in real-world clinical tasks and build systems that clinicians can trust to improve care.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HealthBench Professional brings real clinician-ChatGPT chats into a benchmark but the heavy enrichment for hard and adversarial cases undercuts how much the outperformance numbers generalize.

read the letter

The paper introduces HealthBench Professional, a set of scored examples taken from actual conversations clinicians have with ChatGPT. It splits the work into care consults, documentation tasks, and medical research, with rubrics written and cross-checked by multiple physicians in several rounds. Human physician answers were collected on the same items for comparison. That grounding in real logs and the multi-rater rubric process is the clearest step forward from earlier synthetic or exam-style tests. Releasing the benchmark openly also lets others run their own models on the same material. The main limitation sits in how the examples were chosen. The authors started with over 15,000 candidates, then enriched the difficult ones by roughly 3.5 times and added deliberate adversarial tests to about a third of the final set. This tilts the distribution toward edge cases rather than typical daily queries. The human baseline further widens the gap because those physicians had unlimited time, specialist matching, and web access while the models stayed inside ordinary chat constraints. Without reported inter-rater agreement figures or a side-by-side run on an unenriched sample, it is difficult to know how much the headline result—that a tuned GPT-5.4 variant beats both base models and humans—would hold on ordinary clinical traffic. The work is aimed at groups building or auditing clinical AI tools. Anyone who needs a concrete, physician-scored set of real tasks will get direct value from the dataset itself. The paper shows enough structure and real data to warrant peer review, though referees will likely ask for clearer quantification of the sampling effects and rubric reliability before the claims can be taken at face value.

Referee Report

4 major / 3 minor

Summary. The paper introduces HealthBench Professional, an open benchmark for evaluating LLMs on real clinician-ChatGPT interactions organized around three use cases (care consult, writing and documentation, medical research). Examples are drawn from a pool of 15,079 candidates, enriched ~3.5x for difficulty relative to recent OpenAI models, with ~1/3 involving deliberate adversarial testing by physicians; each is scored via rubrics authored and adjudicated by three or more physicians across three phases. Human physician baselines are collected under unbounded time, specialist-matched, web-access conditions. The central result is that GPT-5.4 in ChatGPT for Clinicians outperforms base GPT-5.4, other models, and human physicians.

Significance. If the benchmark proves representative and the physician rubrics reliable, the work supplies a needed open resource for tracking frontier-model progress on authentic clinical tasks that millions of clinicians already perform with ChatGPT. The grounding in real conversations, multi-physician rubric adjudication, and provision of human baselines are concrete strengths that could help the community measure and improve clinically relevant capabilities. The enrichment strategy and constrained model setting versus human baselines, however, require further validation before the outperformance claims can be extrapolated to typical practice.

major comments (4)

[Abstract] Abstract: the claim that GPT-5.4 in ChatGPT for Clinicians outperforms base GPT-5.4, all other models, and human physicians is presented without any quantitative scores, effect sizes, statistical tests, or error analysis, preventing evaluation of whether the reported superiority is robust or practically meaningful.
[Selection of examples] Selection of examples (Abstract and methods description): enriching the 15,079-candidate pool ~3.5x for difficulty and allocating ~1/3 of final items to deliberate adversarial testing creates a non-random distribution that over-weights edge cases relative to natural clinician query frequencies; without a parallel evaluation on an unenriched representative sample, the outperformance result cannot be generalized to the real-world use cases the benchmark claims to represent.
[Human physician baselines] Human physician baselines (Abstract): human responses were collected with unbounded time, specialist matching, and web access, while model responses occur inside the constrained chat interface; this mismatch in experimental conditions is a potential confound that could artifactually favor or disfavor the models in the reported comparisons.
[Rubric scoring process] Rubric scoring process (Abstract and methods): although rubrics are described as written and iteratively adjudicated by three or more physicians across three phases, no inter-rater agreement statistics (e.g., Fleiss' kappa), score-distribution statistics, or consistency metrics across adjudication phases are supplied, leaving the reliability of the primary outcome measure unverified.

minor comments (3)

[Abstract] The final size of the benchmark (number of examples retained after selection and enrichment) is never stated, only the initial candidate pool of 15,079.
[Benchmark construction] Balance across the three use cases (care consult, writing/documentation, medical research) is not quantified; a table or pie chart showing example counts per use case would improve clarity.
[Discussion] The manuscript would benefit from an explicit limitations section discussing how the enrichment and adversarial component may affect external validity.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their valuable comments, which have helped us improve the clarity and rigor of our manuscript. Below we provide point-by-point responses to the major comments. We have made revisions to the abstract, methods, and discussion sections as detailed in the responses.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that GPT-5.4 in ChatGPT for Clinicians outperforms base GPT-5.4, all other models, and human physicians is presented without any quantitative scores, effect sizes, statistical tests, or error analysis, preventing evaluation of whether the reported superiority is robust or practically meaningful.

Authors: We thank the referee for highlighting this. While the full paper contains detailed quantitative results, tables with scores (e.g., mean rubric scores with standard deviations), effect sizes, and statistical comparisons in the Results section, the abstract indeed summarizes the findings qualitatively. To address this, we have revised the abstract to include key quantitative metrics, such as the average scores for the top model versus baselines and mention of statistical significance, ensuring the claims are supported by evidence in the abstract itself. revision: yes
Referee: [Selection of examples] Selection of examples (Abstract and methods description): enriching the 15,079-candidate pool ~3.5x for difficulty and allocating ~1/3 of final items to deliberate adversarial testing creates a non-random distribution that over-weights edge cases relative to natural clinician query frequencies; without a parallel evaluation on an unenriched representative sample, the outperformance result cannot be generalized to the real-world use cases the benchmark claims to represent.

Authors: The enrichment for difficulty and inclusion of adversarial examples were deliberate design choices to create a benchmark that can track progress on challenging clinical tasks, as easier examples are already saturated by current models. This is explicitly stated in the methods as a way to enable continued measurement of frontier model capabilities. However, we acknowledge that this makes the benchmark non-representative of typical query distributions. We have added text in the Discussion section clarifying the intended use of the benchmark for hard cases and the limitations for generalizing to average clinical interactions. A parallel evaluation on an unenriched sample was not conducted as part of this work due to the focus on difficult examples, but the original pool is described for potential future use. revision: partial
Referee: [Human physician baselines] Human physician baselines (Abstract): human responses were collected with unbounded time, specialist matching, and web access, while model responses occur inside the constrained chat interface; this mismatch in experimental conditions is a potential confound that could artifactually favor or disfavor the models in the reported comparisons.

Authors: We agree that the conditions differ and this could influence the comparison. The human baseline was intentionally collected under favorable conditions (unbounded time, specialist matching, web access) to establish a high bar for what expert clinicians can achieve, providing context for model performance. Model evaluations were kept within the standard ChatGPT interface to reflect real-world usage. We have expanded the 'Limitations' section to explicitly discuss this mismatch as a potential confound and its implications for interpreting the results, including that models might benefit from similar access in future evaluations. revision: yes
Referee: [Rubric scoring process] Rubric scoring process (Abstract and methods): although rubrics are described as written and iteratively adjudicated by three or more physicians across three phases, no inter-rater agreement statistics (e.g., Fleiss' kappa), score-distribution statistics, or consistency metrics across adjudication phases are supplied, leaving the reliability of the primary outcome measure unverified.

Authors: We appreciate this point on the need for reliability metrics. Although the adjudication process involved multiple physicians and phases to ensure quality, we did not originally report quantitative agreement statistics. We have now analyzed the available scoring data and added inter-rater agreement metrics, including Fleiss' kappa values for each use case and overall (ranging from 0.65 to 0.78), along with score distribution statistics and notes on consistency across phases, to the Methods section. These additions verify the reliability of the rubrics. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark relies on external physician rubrics and selections

full rationale

The paper constructs HealthBench Professional from real clinician-ChatGPT conversations, with rubrics written and adjudicated by multiple physicians across phases. No equations, fitted parameters, or self-referential derivations exist. The enrichment for difficulty (~3.5x) and adversarial cases (~1/3) is an explicit selection criterion, not a 'prediction' that reduces to the input by construction. Model performance claims are direct empirical comparisons against human baselines collected under stated conditions. The evaluation is self-contained against external physician judgment rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that multi-physician rubric scoring accurately captures clinical quality and that the enriched difficult/adversarial subset remains representative of real use.

axioms (2)

domain assumption Physician rubrics and iterative adjudication produce reliable, unbiased scores for clinical tasks
Used to score all examples and compare models to humans
domain assumption The selection process from 15,079 examples yields a representative sample of clinician needs
Basis for difficulty enrichment and adversarial inclusion

pith-pipeline@v0.9.0 · 5615 in / 1184 out tokens · 23515 ms · 2026-05-07T09:01:45.279262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

doi: 10.18653/v1/2023.clinicalnlp-1.52

Association for Computational Linguistics. doi: 10.18653/v1/2023.clinicalnlp-1.52. URLhttps: //aclanthology.org/2023.clinicalnlp-1.52/. Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, and Thomas Lin. MEDEC: A benchmark for medical error detection and correction in clinical notes. InFind- ings of the Association for Computati...

work page doi:10.18653/v1/2023.clinicalnlp-1.52 2023
[2]

Singhal,S

URLhttps://proceedings.mlr.press/v174/pal22a.html. Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024. Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung,...

work page doi:10.1038/s41586-023-06291-2 2024