Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

Bryan Wilder; Fei Fang; Mateo Dulce Rubio; Naveen Raman; Santiago Cortes-Gomez

arxiv: 2605.22612 · v1 · pith:VQ7HHZG6new · submitted 2026-05-21 · 💻 cs.CY · cs.AI· cs.LG

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

Naveen Raman , Santiago Cortes-Gomez , Mateo Dulce Rubio , Fei Fang , Bryan Wilder This is my paper

Pith reviewed 2026-05-22 03:36 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG

keywords healthcare LLMsevaluation-deployment gapbenchmark assumptionsstaged evaluationBenchmarkCardsRCT reanalysisuser behaviortask and outcome assumptions

0 comments

The pith

The evaluation-deployment gap in healthcare LLMs stems from implicit assumptions about user behavior that benchmarks alone cannot reveal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that benchmarks for healthcare large language models do not predict real deployment performance mainly because they embed unstated assumptions about how clinicians and patients will actually interact with the outputs. These assumptions split into two types: task assumptions that can be checked against conversation transcripts alone, and outcome assumptions that depend on human decisions and require separate behavioral data or studies to test. Reanalyzing one healthcare randomized trial showed the overall gap split roughly evenly between the two types. The authors introduce BenchmarkCards to record the assumptions explicitly and staged evaluation to test them in sequence before full use. A reader should care because this framing shifts attention from fixing benchmark scores to making the hidden human factors visible and measurable.

Core claim

The evaluation-deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. Assumptions divide into task assumptions, testable from conversation data, and outcome assumptions, which require outcome data and behavioral studies. Retrospective analysis of a healthcare RCT shows the gap naturally separates into task and outcome gaps of roughly equal size. BenchmarkCards document the assumptions and staged evaluation systematically tests them.

What carries the argument

The two-category classification of assumptions into task (testable from conversation data alone) and outcome (requiring outcome data and behavioral studies), which separates the sources of the evaluation-deployment gap and enables BenchmarkCards and staged evaluation.

If this is right

BenchmarkCards would make both task and outcome assumptions explicit for any new healthcare LLM evaluation.
Staged evaluation would allow teams to measure and close the task gap first, then address the outcome gap through targeted studies.
The roughly equal split between task and outcome gaps observed in the RCT reanalysis would recur across other deployments if the framework holds.
Outcome assumptions would need direct testing with real user behavior data rather than proxy metrics from benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams building healthcare LLMs might need to embed simple user-behavior simulations into early benchmark stages to anticipate outcome gaps.
The same task-versus-outcome split could be applied to evaluate LLMs in legal or financial settings where human interpretation also drives results.
Regulatory bodies could require BenchmarkCards as part of safety submissions to ensure outcome assumptions are stated before approval.
Future work could test whether closing the outcome gap requires changes to model interfaces rather than to the model itself.

Load-bearing premise

That outcome assumptions depending on human behavior can be systematically isolated and tested through staged evaluation and behavioral studies separate from benchmark data.

What would settle it

A reanalysis of several additional healthcare RCTs in which the outcome gap either cannot be isolated or accounts for far less than half the total performance drop would undermine the separation claim.

Figures

Figures reproduced from arXiv: 2605.22612 by Bryan Wilder, Fei Fang, Mateo Dulce Rubio, Naveen Raman, Santiago Cortes-Gomez.

read the original abstract

Benchmarks are necessary for healthcare evaluation, but are not sufficient for predicting deployment performance. Our position is that the evaluation--deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. To make this precise, we propose a classification of assumptions into two categories: task, which can be tested from conversation data alone, and outcome, which requires outcome data and behavioral studies for testing. Critically, outcome assumptions depend on human behavior, something that even well-designed benchmarks cannot directly observe. To demonstrate the operationality of this framework, we retrospectively analyze a healthcare RCT as a case study and find that the gap naturally separates into task and outcome gaps of roughly equal size. To address this, we make two contributions: first, we propose BenchmarkCards, an artifact that documents assumptions, and second, we propose staged evaluation, a procedure that systematically tests assumptions and evaluates performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper usefully splits assumptions in healthcare LLM benchmarks into task and outcome types with BenchmarkCards as a documentation tool, but the RCT reanalysis lacks explicit metrics for the claimed equal gap split.

read the letter

The paper argues that healthcare LLM benchmarks fail to predict deployment success mainly due to hidden assumptions about how users will behave with the model outputs. It classifies these assumptions as either task-related, which can be evaluated using only the conversation data, or outcome-related, which depend on actual clinical results and studies of human interaction. What is new here is the explicit task versus outcome categorization and the BenchmarkCards as a documentation tool. The staged evaluation procedure is another practical addition. The retrospective RCT case study illustrates the idea by showing the evaluation-deployment gap splitting into roughly equal task and outcome parts. This case study is a strength because it applies the framework to real data from a trial, giving the position some empirical backing rather than leaving it purely theoretical. The soft spot is the lack of explicit quantitative definitions in how the RCT data was partitioned into the two gap types. Without measurable criteria like specific intent accuracy for task gaps or downstream metric differences for outcome gaps, the equal size claim could be sensitive to different analysis choices. The new artifacts and procedures are described at a conceptual level without extensive testing. This is for evaluators and developers of AI tools in healthcare who want to address why benchmarks don't translate well to practice. It provides a structured lens for thinking about these issues. The work shows honest engagement with the problem and the evidence, so it deserves serious peer review to tighten up the demonstration. I recommend sending it for review, with feedback focused on defining the metrics used in the case study more rigorously.

Referee Report

2 major / 2 minor

Summary. The paper argues that the evaluation-deployment gap for healthcare LLMs arises from implicit assumptions about user-model interactions that benchmarks alone cannot surface, rather than from inadequate benchmark design. It distinguishes task assumptions (testable via conversation data) from outcome assumptions (requiring outcome data and behavioral studies due to dependence on human behavior). A retrospective reanalysis of a healthcare RCT is used to show that this gap separates into task and outcome components of roughly equal magnitude. The authors propose BenchmarkCards as an artifact to document assumptions explicitly and a staged evaluation procedure to test them systematically before deployment.

Significance. If the framework and case-study separation hold, the work could encourage more assumption-transparent benchmarking practices in healthcare AI, helping practitioners anticipate real-world performance shortfalls that current benchmarks miss. The grounding in external RCT data rather than self-referential fitting is a positive feature, as is the attempt to operationalize the distinction between task and outcome gaps. However, the proposals for BenchmarkCards and staged evaluation receive only conceptual treatment, so the primary significance at present is in reframing the problem rather than in delivering immediately usable tools.

major comments (2)

[RCT case study / retrospective analysis] The central demonstration that the evaluation-deployment gap 'naturally separates into task and outcome gaps of roughly equal size' rests on the retrospective RCT case study. The manuscript does not supply explicit quantitative definitions or measurable quantities for partitioning the data: for example, it is unclear how task-gap size would be computed from conversation logs (e.g., intent-extraction accuracy) versus outcome-gap size (e.g., downstream clinical metric differences after human mediation). Absent such definitions, alternative attributions of the same observations could alter or eliminate the equal-magnitude finding, weakening the evidence that outcome assumptions require separate behavioral testing.
[Contributions / BenchmarkCards and staged evaluation] The descriptions of BenchmarkCards and the staged evaluation procedure remain high-level and lack concrete templates, worked examples, or pilot results. Because these artifacts are presented as the practical response to the identified gaps, the absence of even minimal operational detail makes it difficult to evaluate whether they can be implemented without introducing new untested assumptions.

minor comments (2)

[Introduction] Clarify early in the introduction whether 'task assumptions' and 'outcome assumptions' are intended as exhaustive categories or whether hybrid cases are acknowledged.
[Case study] Add a short table or figure summarizing the RCT reanalysis metrics (e.g., before/after gap sizes) to make the equal-magnitude claim easier to inspect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. The feedback identifies opportunities to strengthen the quantitative grounding of the RCT case study and to add operational detail to the proposed artifacts. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [RCT case study / retrospective analysis] The central demonstration that the evaluation-deployment gap 'naturally separates into task and outcome gaps of roughly equal size' rests on the retrospective RCT case study. The manuscript does not supply explicit quantitative definitions or measurable quantities for partitioning the data: for example, it is unclear how task-gap size would be computed from conversation logs (e.g., intent-extraction accuracy) versus outcome-gap size (e.g., downstream clinical metric differences after human mediation). Absent such definitions, alternative attributions of the same observations could alter or eliminate the equal-magnitude finding, weakening the evidence that outcome assumptions require separate behavioral testing.

Authors: We agree that the current presentation would be strengthened by explicit quantitative definitions. In the revision we will define the task gap as the discrepancy between benchmark-predicted performance and observed metrics from RCT conversation logs (e.g., intent recognition accuracy or action prediction F1). The outcome gap will be defined as the residual difference in downstream clinical metrics after subtracting the task-level discrepancy, isolating effects attributable to unmodeled human behavior. We will also include a brief discussion of how alternative partitionings were considered and why the data support the reported separation. These additions will make the equal-magnitude claim more testable and address concerns about alternative attributions. revision: yes
Referee: [Contributions / BenchmarkCards and staged evaluation] The descriptions of BenchmarkCards and the staged evaluation procedure remain high-level and lack concrete templates, worked examples, or pilot results. Because these artifacts are presented as the practical response to the identified gaps, the absence of even minimal operational detail makes it difficult to evaluate whether they can be implemented without introducing new untested assumptions.

Authors: We acknowledge that the proposals are currently conceptual. In the revised manuscript we will supply a concrete BenchmarkCards template with fields for task assumptions, outcome assumptions, data sources for testing each, and an example populated using the RCT case. For staged evaluation we will add a worked example that walks through sequential testing of assumptions using conversation logs followed by outcome data. These additions will illustrate implementation steps while noting any assumptions that remain. Full empirical pilots lie beyond the scope of this position paper but can be pursued in follow-up work. revision: yes

Circularity Check

0 steps flagged

Framework grounded in external RCT data with no self-referential reductions or fitted predictions

full rationale

The paper's central claim—that the evaluation-deployment gap separates into task and outcome components of roughly equal size—is demonstrated via retrospective analysis of an independent healthcare RCT case study rather than any internal fitting, self-defined parameters, or load-bearing self-citations. No equations or derivations reduce the classification of assumptions or the proposed BenchmarkCards/staged evaluation procedure to inputs defined by the authors' prior work. The framework draws on external data for its empirical demonstration, making the derivation self-contained against external benchmarks and yielding only minor (non-load-bearing) circularity risk at most.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption that assumptions can be cleanly partitioned into task and outcome types and that outcome assumptions require external behavioral data; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Assumptions underlying benchmarks can be partitioned into task assumptions observable from conversation data and outcome assumptions requiring behavioral studies
This partition is the core of the proposed framework and is invoked to explain the evaluation-deployment gap.

pith-pipeline@v0.9.0 · 5701 in / 1144 out tokens · 36258 ms · 2026-05-22T03:36:04.535139+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

[1]

Medical large language model benchmarks should prioritize construct validity.arXiv preprint arXiv:2503.10694,

Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deb- orah Raji, and Travis Zack. Medical large language model benchmarks should prioritize construct validity.arXiv preprint arXiv:2503.10694,

work page arXiv
[2]

A shared standard for valid measurement of generative ai systems’ capabilities, risks, and impacts.arXiv preprint arXiv:2412.01934,

Alexandra Chouldechova, Chad Atalla, Solon Barocas, A Feder Cooper, Emily Corvi, P Alex Dow, Jean Garcia-Gathright, Nicholas Pangakis, Stefanie Reed, Emily Sheng, et al. A shared standard for valid measurement of generative ai systems’ capabilities, risks, and impacts.arXiv preprint arXiv:2412.01934,

work page arXiv
[3]

Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12 (1):1977,

Fatemeh Dehghani, Roya Dehghani, Yazdan Naderzadeh Ardebili, and Shahryar Rahnamayan. Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12 (1):1977,

work page 1977
[4]

Evalcards: A framework for standardized evaluation reporting.arXiv preprint arXiv:2511.21695,

Ruchira Dhar, Danae Sanchez Villegas, Antonia Karamolegkou, Alice Schiavone, Yifei Yuan, Xinyi Chen, Jiaang Li, Stella Frank, Laura De Grazia, Monorama Swain, et al. Evalcards: A framework for standardized evaluation reporting.arXiv preprint arXiv:2511.21695,

work page arXiv
[5]

Evaluation gaps in machine learning practice

Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prab- hakaran. Evaluation gaps in machine learning practice. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages 1859–1876,

work page 2022
[6]

Measurement and fairness

Abigail Z Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 375–385,

work page 2021
[7]

Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

Charlotte Li, Nick Hagar, Sachita Nishal, Jeremy Gilbert, and Nick Diakopoulos. Towards eco- logically valid llm benchmarks: Understanding and designing domain-centered evaluations for journalism practitioners.arXiv preprint arXiv:2511.05501, 2025a. Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. InProceeding...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717, 2025b. Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machi...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Categorizing Variants of Goodhart's Law

David Manheim and Scott Garrabrant. Categorizing variants of goodhart’s law.arXiv preprint arXiv:1803.04585,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

11 Inioluwa Deborah Raji, Emily M Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

work page arXiv
[11]

Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2411.10939,

Hanna Wallach, Meera Desai, Nicholas Pangakis, A Feder Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, et al. Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2411.10939,

work page arXiv
[12]

Clinconsensus: A consensus-based benchmark for evaluating chinese medical llms across difficulty levels.arXiv preprint arXiv:2603.02097,

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, et al. Clinconsensus: A consensus-based benchmark for evaluating chinese medical llms across difficulty levels.arXiv preprint arXiv:2603.02097,

work page arXiv
[13]

[2024], where the benchmark is licensing exams and deployment is clinicians from MIMIC IV [Johnson et al., 2023]

12 Table 3: BenchmarkCard (left, filled once by benchmark designers) and practitioner deployment assessment (right, filled per deployment context) for Hager et al. [2024], where the benchmark is licensing exams and deployment is clinicians from MIMIC IV [Johnson et al., 2023]. Question Assumption Answer Holds at deployment? What is the intended use case? ...

work page 2024

[1] [1]

Medical large language model benchmarks should prioritize construct validity.arXiv preprint arXiv:2503.10694,

Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deb- orah Raji, and Travis Zack. Medical large language model benchmarks should prioritize construct validity.arXiv preprint arXiv:2503.10694,

work page arXiv

[2] [2]

A shared standard for valid measurement of generative ai systems’ capabilities, risks, and impacts.arXiv preprint arXiv:2412.01934,

Alexandra Chouldechova, Chad Atalla, Solon Barocas, A Feder Cooper, Emily Corvi, P Alex Dow, Jean Garcia-Gathright, Nicholas Pangakis, Stefanie Reed, Emily Sheng, et al. A shared standard for valid measurement of generative ai systems’ capabilities, risks, and impacts.arXiv preprint arXiv:2412.01934,

work page arXiv

[3] [3]

Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12 (1):1977,

Fatemeh Dehghani, Roya Dehghani, Yazdan Naderzadeh Ardebili, and Shahryar Rahnamayan. Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12 (1):1977,

work page 1977

[4] [4]

Evalcards: A framework for standardized evaluation reporting.arXiv preprint arXiv:2511.21695,

Ruchira Dhar, Danae Sanchez Villegas, Antonia Karamolegkou, Alice Schiavone, Yifei Yuan, Xinyi Chen, Jiaang Li, Stella Frank, Laura De Grazia, Monorama Swain, et al. Evalcards: A framework for standardized evaluation reporting.arXiv preprint arXiv:2511.21695,

work page arXiv

[5] [5]

Evaluation gaps in machine learning practice

Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prab- hakaran. Evaluation gaps in machine learning practice. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency, pages 1859–1876,

work page 2022

[6] [6]

Measurement and fairness

Abigail Z Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 375–385,

work page 2021

[7] [7]

Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

Charlotte Li, Nick Hagar, Sachita Nishal, Jeremy Gilbert, and Nick Diakopoulos. Towards eco- logically valid llm benchmarks: Understanding and designing domain-centered evaluations for journalism practitioners.arXiv preprint arXiv:2511.05501, 2025a. Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. InProceeding...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717, 2025b. Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machi...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Categorizing Variants of Goodhart's Law

David Manheim and Scott Garrabrant. Categorizing variants of goodhart’s law.arXiv preprint arXiv:1803.04585,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

11 Inioluwa Deborah Raji, Emily M Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

work page arXiv

[11] [11]

Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2411.10939,

Hanna Wallach, Meera Desai, Nicholas Pangakis, A Feder Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, et al. Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2411.10939,

work page arXiv

[12] [12]

Clinconsensus: A consensus-based benchmark for evaluating chinese medical llms across difficulty levels.arXiv preprint arXiv:2603.02097,

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, et al. Clinconsensus: A consensus-based benchmark for evaluating chinese medical llms across difficulty levels.arXiv preprint arXiv:2603.02097,

work page arXiv

[13] [13]

[2024], where the benchmark is licensing exams and deployment is clinicians from MIMIC IV [Johnson et al., 2023]

12 Table 3: BenchmarkCard (left, filled once by benchmark designers) and practitioner deployment assessment (right, filled per deployment context) for Hager et al. [2024], where the benchmark is licensing exams and deployment is clinicians from MIMIC IV [Johnson et al., 2023]. Question Assumption Answer Holds at deployment? What is the intended use case? ...

work page 2024