A rubric-based controlled comparison of frontier language models on expert-authored clinical reasoning tasks
Pith reviewed 2026-07-03 14:03 UTC · model grok-4.3
The pith
Frontier LLMs pass low-stakes clinical criteria at 80-90 percent but critical ones at only 32-42 percent, missing more than half of all weight-5 items.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On five expert-drafted scenarios spanning anaesthesia, internal medicine, emergency medicine, and obstetrics, GPT 5.4, Claude Opus 4.7, and Gemini 3.1 Pro achieve mean rubric pass rates of 0.39, 0.47, and 0.37 respectively. The 108 weight-5 criteria are satisfied only 32.4-41.7 percent of the time while the weight-1 criteria reach 80-90 percent, and 56 of those 108 critical criteria are satisfied by no model at all.
What carries the argument
Atomic, weighted, MECE rubrics (184 criteria total) derived from clinician-drafted golden answers for each of the five scenarios, used to score model free-text responses on a met/not-met basis.
If this is right
- Model performance drops sharply on the criteria clinicians assign highest clinical weight.
- More than half of all critical criteria remain unsatisfied by any of the three models tested.
- Low-stakes criteria are solved at high rates while high-stakes ones are not.
- LLM-based autoraters can reproduce expert rubric labels at 92.8-94.7 percent agreement.
- The five-task pipeline is presented as ready to expand into a larger benchmark.
Where Pith is reading between the lines
- Evaluations that ignore clinical weighting may overestimate readiness for real deployment.
- Training objectives focused on critical reasoning steps could close the observed gap.
- The same rubric method could be applied to other high-stakes domains beyond medicine.
Load-bearing premise
The clinician-authored rubrics accurately capture what counts as correct clinical reasoning in the five chosen scenarios.
What would settle it
A new model run on the same five scenarios that satisfies more than half of the weight-5 criteria while the existing models satisfy fewer than half.
read the original abstract
Multiple-choice medical benchmarks are increasingly saturated, and recent rubric-based evaluations such as HealthBench have shown that open-ended clinical performance is far from solved - its "Hard" subset top score remains 32%. We present a small, deliberately difficult evaluation dataset of five clinician-authored clinical scenarios spanning four specialties (anaesthesia, internal/family medicine, emergency medicine, and obstetrics), each accompanied by an atomic, weighted, MECE rubric (25-62 criteria per task; 184 criteria total) authored from a clinician-drafted golden answer. We evaluate three frontier models: GPT 5.4, Claude Opus 4.7, and Gemini 3.1 Pro. Mean rubric pass rates were 0.47 (Claude), 0.39 (GPT), and 0.37 (Gemini). The central finding is an inversion of clinical priority: the highest-weighted (weight-5, critical) criteria passed at only 32.4-41.7%, while low-stakes weight-1 criteria passed at 80-90%. 56 of 108 critical (weight-5) criteria (52%) were satisfied by no model. Three LLM autoraters reproduced expert met/not-met labels on 92.8-94.7% of 552 graded criteria. We position this as a methods-and-preliminary-findings contribution: the five tasks demonstrate a scalable, defensible pipeline ready to develop into a large-scale benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a small evaluation set of five clinician-authored clinical scenarios (spanning anaesthesia, internal/family medicine, emergency medicine, and obstetrics) each paired with an atomic, weighted, MECE rubric (25–62 criteria per task; 184 total) derived from a golden answer. Three frontier models (GPT 5.4, Claude Opus 4.7, Gemini 3.1 Pro) are evaluated, yielding mean rubric pass rates of 0.47, 0.39, and 0.37. The central empirical claim is a clinical-priority inversion: weight-5 (critical) criteria are passed at 32.4–41.7 % while weight-1 criteria reach 80–90 %, with 56 of 108 weight-5 criteria (52 %) satisfied by none of the models. LLM autoraters achieve 92.8–94.7 % agreement with expert labels on 552 graded items. The work is positioned as a methods-and-preliminary-findings contribution demonstrating a scalable rubric pipeline.
Significance. If the reported inversion proves robust, the result would highlight a systematic misalignment between frontier-model behavior and clinical priorities, with direct implications for safety-critical deployment. The explicit weighting, MECE construction, and clinician authorship of the rubrics constitute a methodological advance over saturated multiple-choice benchmarks and the HealthBench “Hard” subset. The high autorater agreement further supports the feasibility of scaling such evaluations. Because the study is explicitly framed as preliminary and limited to five tasks, its primary contribution lies in the reusable pipeline rather than in a general claim about model capabilities.
major comments (3)
- [Abstract / Methods] Abstract and Methods: the manuscript reports specific pass rates and autorater agreement but supplies no information on the rubric-validation process (e.g., inter-clinician agreement on weights or MECE completeness), model prompt templates, temperature settings, or statistical tests for the weight-class differences. These omissions are load-bearing for the inversion claim, as they prevent assessment of whether the 32–42 % vs. 80–90 % gap is reproducible or artifactual.
- [Results] Results (central finding paragraph): the priority inversion is computed over only five tasks with no reported cross-task variance, confidence intervals, or sensitivity analysis to rubric completeness. The claim that 52 % of weight-5 criteria are missed by all models therefore rests on an unquantified assumption that the chosen scenarios are representative; this directly affects the generalizability asserted in the positioning statement.
- [Discussion] Discussion / positioning statement: the paper asserts that the five tasks “demonstrate a scalable, defensible pipeline,” yet provides no analysis of task-selection criteria or inter-rater reliability on the weight assignments. Without such evidence the inversion cannot be confidently attributed to model properties rather than scenario idiosyncrasies.
minor comments (2)
- [Abstract] The abstract cites HealthBench but does not supply a reference or comparison table; adding this would clarify the claimed advance.
- [Abstract] Notation for model versions (GPT 5.4, Claude Opus 4.7, Gemini 3.1 Pro) should be standardized or footnoted with exact release identifiers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on transparency and scope. We agree that additional details and explicit limitations are warranted given the preliminary framing. We will revise the manuscript accordingly and address each point below.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods: the manuscript reports specific pass rates and autorater agreement but supplies no information on the rubric-validation process (e.g., inter-clinician agreement on weights or MECE completeness), model prompt templates, temperature settings, or statistical tests for the weight-class differences. These omissions are load-bearing for the inversion claim, as they prevent assessment of whether the 32–42 % vs. 80–90 % gap is reproducible or artifactual.
Authors: We agree these details improve reproducibility. In revision we will add the exact prompt templates and temperature (set to 0.0 for determinism) to Methods. Rubrics were constructed by the single authoring clinician per task from the golden answer to ensure atomic MECE structure and clinical-priority weights; no inter-clinician agreement was measured owing to the small preliminary scope. We will add this limitation explicitly. No formal statistical tests were performed given n=5 tasks; we will include per-task pass-rate breakdowns. revision: partial
-
Referee: [Results] Results (central finding paragraph): the priority inversion is computed over only five tasks with no reported cross-task variance, confidence intervals, or sensitivity analysis to rubric completeness. The claim that 52 % of weight-5 criteria are missed by all models therefore rests on an unquantified assumption that the chosen scenarios are representative; this directly affects the generalizability asserted in the positioning statement.
Authors: The work is framed as preliminary findings on five tasks, not a general claim. The 52 % figure is an observation within these rubrics. We will add per-task breakdowns to show variation. Confidence intervals and sensitivity analyses to rubric completeness are not feasible without additional clinician variants; we will strengthen the limitations discussion and temper generalizability language in the positioning statement. revision: partial
-
Referee: [Discussion] Discussion / positioning statement: the paper asserts that the five tasks “demonstrate a scalable, defensible pipeline,” yet provides no analysis of task-selection criteria or inter-rater reliability on the weight assignments. Without such evidence the inversion cannot be confidently attributed to model properties rather than scenario idiosyncrasies.
Authors: We will revise the positioning statement to describe task selection (clinician judgment prioritizing high-stakes domains across four specialties) and to state that the pipeline is shown on these five tasks as a proof of concept. Inter-rater reliability on weights was not assessed; we will add this as an explicit limitation and note that larger-scale follow-up would incorporate such validation. revision: yes
- Inter-clinician agreement metrics on rubric weights and MECE completeness, which were not collected in the original preliminary study.
Circularity Check
No circularity: purely empirical rubric-based comparison
full rationale
The paper reports direct empirical measurements of model performance on 184 clinician-authored rubric criteria across five scenarios. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or positioning. The central inversion finding is computed from observed pass rates on weight-5 vs. weight-1 criteria; the methods contribution explicitly frames the work as preliminary and scalable rather than claiming a general result by construction. No load-bearing step reduces to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Arora, R. K., Wei, J., Soskin Hicks, R., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., & Singhal, K. (2025).HealthBench: Evaluating large language models towards improved human health.arXiv.https://arxiv.org/abs/2505. 08775 Bedi, S., Jain, S. S., Chandra, R., Pierson, E., Koyejo, S., Stoy...
-
[2]
Jin, Q., Dhingra, B., Liu, Z., Cohen, W., & Lu, X. (2019). PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing. Liu, T., Xu, Z., Hu, Z., Shi, W., Zhuang, Y ., & Yu, H. (2025).OpenRubrics: Towards scalable synthetic rubric generation for reward modeling a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.