A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

Ali Merali; Fan X. Chen; Samiha A. Ismail

arxiv: 2607.02175 · v1 · pith:JBNKV4GKnew · submitted 2026-07-02 · 💻 cs.AI · cs.LG

A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks

Samiha A. Ismail , Fan X. Chen , Ali Merali This is my paper

Pith reviewed 2026-07-03 14:03 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords clinical reasoningrubric evaluationfrontier language modelsmedical AI benchmarksweighted criteriaperformance inversionexpert-authored tasks

0 comments

The pith

Frontier LLMs pass low-stakes clinical criteria at 80-90 percent but critical ones at only 32-42 percent, missing more than half of all weight-5 items.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a controlled evaluation of three frontier models on five clinician-authored clinical scenarios using atomic weighted rubrics that break each task into 25-62 MECE criteria. It shows mean pass rates of 0.37-0.47 overall, yet reveals a sharp inversion: the most important weight-5 criteria succeed only 32.4-41.7 percent of the time while weight-1 criteria reach 80-90 percent, and 52 percent of the 108 critical criteria are missed by every model. This matters because it demonstrates that current models systematically underperform on the elements clinicians judge most consequential for patient outcomes. The work also reports that separate LLM autoraters match expert met/not-met labels on 92.8-94.7 percent of graded items, supporting the rubric method as scalable. The contribution is positioned as a methods pipeline plus preliminary findings rather than a large benchmark.

Core claim

On five expert-drafted scenarios spanning anaesthesia, internal medicine, emergency medicine, and obstetrics, GPT 5.4, Claude Opus 4.7, and Gemini 3.1 Pro achieve mean rubric pass rates of 0.39, 0.47, and 0.37 respectively. The 108 weight-5 criteria are satisfied only 32.4-41.7 percent of the time while the weight-1 criteria reach 80-90 percent, and 56 of those 108 critical criteria are satisfied by no model at all.

What carries the argument

Atomic, weighted, MECE rubrics (184 criteria total) derived from clinician-drafted golden answers for each of the five scenarios, used to score model free-text responses on a met/not-met basis.

If this is right

Model performance drops sharply on the criteria clinicians assign highest clinical weight.
More than half of all critical criteria remain unsatisfied by any of the three models tested.
Low-stakes criteria are solved at high rates while high-stakes ones are not.
LLM-based autoraters can reproduce expert rubric labels at 92.8-94.7 percent agreement.
The five-task pipeline is presented as ready to expand into a larger benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluations that ignore clinical weighting may overestimate readiness for real deployment.
Training objectives focused on critical reasoning steps could close the observed gap.
The same rubric method could be applied to other high-stakes domains beyond medicine.

Load-bearing premise

The clinician-authored rubrics accurately capture what counts as correct clinical reasoning in the five chosen scenarios.

What would settle it

A new model run on the same five scenarios that satisfies more than half of the weight-5 criteria while the existing models satisfy fewer than half.

read the original abstract

Multiple-choice medical benchmarks are increasingly saturated, and recent rubric-based evaluations such as HealthBench have shown that open-ended clinical performance is far from solved - its "Hard" subset top score remains 32%. We present a small, deliberately difficult evaluation dataset of five clinician-authored clinical scenarios spanning four specialties (anaesthesia, internal/family medicine, emergency medicine, and obstetrics), each accompanied by an atomic, weighted, MECE rubric (25-62 criteria per task; 184 criteria total) authored from a clinician-drafted golden answer. We evaluate three frontier models: GPT 5.4, Claude Opus 4.7, and Gemini 3.1 Pro. Mean rubric pass rates were 0.47 (Claude), 0.39 (GPT), and 0.37 (Gemini). The central finding is an inversion of clinical priority: the highest-weighted (weight-5, critical) criteria passed at only 32.4-41.7%, while low-stakes weight-1 criteria passed at 80-90%. 56 of 108 critical (weight-5) criteria (52%) were satisfied by no model. Three LLM autoraters reproduced expert met/not-met labels on 92.8-94.7% of 552 graded criteria. We position this as a methods-and-preliminary-findings contribution: the five tasks demonstrate a scalable, defensible pipeline ready to develop into a large-scale benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows frontier models missing over half of weight-5 clinical criteria across five scenarios while nailing low-stakes ones, but the tiny sample keeps the inversion claim preliminary.

read the letter

The main takeaway is the reported inversion: weight-5 criteria passed at 32-42% while weight-1 items hit 80-90%, and 52% of the critical criteria were missed by all three models. That gap is the concrete signal worth paying attention to.

The work adds a new five-scenario dataset with atomic, clinician-authored, weighted MECE rubrics (184 criteria total) built from golden answers. It evaluates GPT 5.4, Claude Opus 4.7, and Gemini 3.1 Pro, reports mean pass rates of 0.47/0.39/0.37, and shows LLM autoraters matching expert labels at 93-95%. The rubric format and the priority-inversion observation extend HealthBench in a straightforward way.

The pipeline for turning clinician drafts into graded criteria is described clearly enough to replicate, and the numbers are presented without overclaim. That part is solid for a methods-and-preliminary piece.

The soft spot is scale. Five tasks across four specialties give a suggestive pattern, but without reported variance, sensitivity checks on rubric completeness, or statistical tests the inversion could still be tied to these particular scenarios rather than a general model property. The abstract leaves rubric validation and prompt details thin, so the central claim rests on the assumption that these rubrics are representative.

This is for groups building clinical evaluation benchmarks who want a worked example of weighted atomic rubrics. It deserves peer review because the empirical setup is transparent and the rubric method is usable by others, even if the authors correctly call the findings preliminary and more tasks would strengthen it.

Referee Report

3 major / 2 minor

Summary. The paper introduces a small evaluation set of five clinician-authored clinical scenarios (spanning anaesthesia, internal/family medicine, emergency medicine, and obstetrics) each paired with an atomic, weighted, MECE rubric (25–62 criteria per task; 184 total) derived from a golden answer. Three frontier models (GPT 5.4, Claude Opus 4.7, Gemini 3.1 Pro) are evaluated, yielding mean rubric pass rates of 0.47, 0.39, and 0.37. The central empirical claim is a clinical-priority inversion: weight-5 (critical) criteria are passed at 32.4–41.7 % while weight-1 criteria reach 80–90 %, with 56 of 108 weight-5 criteria (52 %) satisfied by none of the models. LLM autoraters achieve 92.8–94.7 % agreement with expert labels on 552 graded items. The work is positioned as a methods-and-preliminary-findings contribution demonstrating a scalable rubric pipeline.

Significance. If the reported inversion proves robust, the result would highlight a systematic misalignment between frontier-model behavior and clinical priorities, with direct implications for safety-critical deployment. The explicit weighting, MECE construction, and clinician authorship of the rubrics constitute a methodological advance over saturated multiple-choice benchmarks and the HealthBench “Hard” subset. The high autorater agreement further supports the feasibility of scaling such evaluations. Because the study is explicitly framed as preliminary and limited to five tasks, its primary contribution lies in the reusable pipeline rather than in a general claim about model capabilities.

major comments (3)

[Abstract / Methods] Abstract and Methods: the manuscript reports specific pass rates and autorater agreement but supplies no information on the rubric-validation process (e.g., inter-clinician agreement on weights or MECE completeness), model prompt templates, temperature settings, or statistical tests for the weight-class differences. These omissions are load-bearing for the inversion claim, as they prevent assessment of whether the 32–42 % vs. 80–90 % gap is reproducible or artifactual.
[Results] Results (central finding paragraph): the priority inversion is computed over only five tasks with no reported cross-task variance, confidence intervals, or sensitivity analysis to rubric completeness. The claim that 52 % of weight-5 criteria are missed by all models therefore rests on an unquantified assumption that the chosen scenarios are representative; this directly affects the generalizability asserted in the positioning statement.
[Discussion] Discussion / positioning statement: the paper asserts that the five tasks “demonstrate a scalable, defensible pipeline,” yet provides no analysis of task-selection criteria or inter-rater reliability on the weight assignments. Without such evidence the inversion cannot be confidently attributed to model properties rather than scenario idiosyncrasies.

minor comments (2)

[Abstract] The abstract cites HealthBench but does not supply a reference or comparison table; adding this would clarify the claimed advance.
[Abstract] Notation for model versions (GPT 5.4, Claude Opus 4.7, Gemini 3.1 Pro) should be standardized or footnoted with exact release identifiers.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments on transparency and scope. We agree that additional details and explicit limitations are warranted given the preliminary framing. We will revise the manuscript accordingly and address each point below.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: the manuscript reports specific pass rates and autorater agreement but supplies no information on the rubric-validation process (e.g., inter-clinician agreement on weights or MECE completeness), model prompt templates, temperature settings, or statistical tests for the weight-class differences. These omissions are load-bearing for the inversion claim, as they prevent assessment of whether the 32–42 % vs. 80–90 % gap is reproducible or artifactual.

Authors: We agree these details improve reproducibility. In revision we will add the exact prompt templates and temperature (set to 0.0 for determinism) to Methods. Rubrics were constructed by the single authoring clinician per task from the golden answer to ensure atomic MECE structure and clinical-priority weights; no inter-clinician agreement was measured owing to the small preliminary scope. We will add this limitation explicitly. No formal statistical tests were performed given n=5 tasks; we will include per-task pass-rate breakdowns. revision: partial
Referee: [Results] Results (central finding paragraph): the priority inversion is computed over only five tasks with no reported cross-task variance, confidence intervals, or sensitivity analysis to rubric completeness. The claim that 52 % of weight-5 criteria are missed by all models therefore rests on an unquantified assumption that the chosen scenarios are representative; this directly affects the generalizability asserted in the positioning statement.

Authors: The work is framed as preliminary findings on five tasks, not a general claim. The 52 % figure is an observation within these rubrics. We will add per-task breakdowns to show variation. Confidence intervals and sensitivity analyses to rubric completeness are not feasible without additional clinician variants; we will strengthen the limitations discussion and temper generalizability language in the positioning statement. revision: partial
Referee: [Discussion] Discussion / positioning statement: the paper asserts that the five tasks “demonstrate a scalable, defensible pipeline,” yet provides no analysis of task-selection criteria or inter-rater reliability on the weight assignments. Without such evidence the inversion cannot be confidently attributed to model properties rather than scenario idiosyncrasies.

Authors: We will revise the positioning statement to describe task selection (clinician judgment prioritizing high-stakes domains across four specialties) and to state that the pipeline is shown on these five tasks as a proof of concept. Inter-rater reliability on weights was not assessed; we will add this as an explicit limitation and note that larger-scale follow-up would incorporate such validation. revision: yes

standing simulated objections not resolved

Inter-clinician agreement metrics on rubric weights and MECE completeness, which were not collected in the original preliminary study.

Circularity Check

0 steps flagged

No circularity: purely empirical rubric-based comparison

full rationale

The paper reports direct empirical measurements of model performance on 184 clinician-authored rubric criteria across five scenarios. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or positioning. The central inversion finding is computed from observed pass rates on weight-5 vs. weight-1 criteria; the methods contribution explicitly frames the work as preliminary and scalable rather than claiming a general result by construction. No load-bearing step reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper; central claims rest on the unverified assumption that the five scenarios and their rubrics constitute a valid proxy for clinical reasoning quality. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5800 in / 1172 out tokens · 43028 ms · 2026-07-03T14:03:24.943037+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

K., Wei, J., Soskin Hicks, R., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., & Singhal, K

Arora, R. K., Wei, J., Soskin Hicks, R., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., & Singhal, K. (2025).HealthBench: Evaluating large language models towards improved human health.arXiv.https://arxiv.org/abs/2505. 08775 Bedi, S., Jain, S. S., Chandra, R., Pierson, E., Koyejo, S., Stoy...

work page arXiv 2025
[2]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W., & Lu, X. (2019). PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing. Liu, T., Xu, Z., Hu, Z., Shi, W., Zhuang, Y ., & Yu, H. (2025).OpenRubrics: Towards scalable synthetic rubric generation for reward modeling a...

work page arXiv 2019

[1] [1]

K., Wei, J., Soskin Hicks, R., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., & Singhal, K

Arora, R. K., Wei, J., Soskin Hicks, R., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., & Singhal, K. (2025).HealthBench: Evaluating large language models towards improved human health.arXiv.https://arxiv.org/abs/2505. 08775 Bedi, S., Jain, S. S., Chandra, R., Pierson, E., Koyejo, S., Stoy...

work page arXiv 2025

[2] [2]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W., & Lu, X. (2019). PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing. Liu, T., Xu, Z., Hu, Z., Shi, W., Zhuang, Y ., & Yu, H. (2025).OpenRubrics: Towards scalable synthetic rubric generation for reward modeling a...

work page arXiv 2019