Same Verdict, Different Reasons: LLM-as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness
Pith reviewed 2026-05-14 23:58 UTC · model grok-4.3
The pith
LLM judges and clinicians apply different standards for what counts as a complete medical chatbot response
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM judges discriminate complete from incomplete medical responses at and slightly above chance levels. At the operating point that catches 90 percent of incomplete answers, clinicians still must review the great majority of outputs, yielding no triage gain. Even on matching verdicts the cited explanations diverge, with false positives arising from over-sensitivity to non-essential details and false negatives from outright failure to notice omissions. The authors conclude that LLM judges and clinicians operate under fundamentally different completeness criteria.
What carries the argument
Direct comparison of LLM verdict-plus-explanation against clinician ground-truth labels across three rubric granularities (General-Likert, Analytical-Rubric, Dynamic-Checklist) on two medical response datasets.
If this is right
- LLM judges cannot be deployed as standalone evaluators of medical chatbot completeness.
- They offer no reduction in clinician review workload when used as triage filters.
- Verdict agreement between model and clinician does not imply shared reasoning about quality.
- Over-flagging and missed gaps both remain common failure modes under current prompting.
Where Pith is reading between the lines
- Explanation agreement may be a stronger evaluation target than verdict agreement alone for future medical LLM judges.
- Hybrid human-plus-model pipelines will likely be required until models can be trained to internalize clinician completeness criteria.
- Benchmarks that report only accuracy or AUC on completeness labels will understate the practical risk of mismatched standards.
Load-bearing premise
That the clinician annotations used as ground truth accurately and completely capture the clinically relevant aspects of response completeness.
What would settle it
A replication in which LLM judges and clinicians cite the same specific missing elements in the majority of cases where both label a response incomplete.
Figures
read the original abstract
LLM-as-a-Judge frameworks are increasingly trusted to automate evaluation in place of human experts, yet their reliability in high-stakes medical contexts remains unproven. We stress-test this assumption for detecting incomplete patient-facing medical responses, evaluating three rubric granularities (General-Likert, Analytical-Rubric, Dynamic-Checklist) and three backbone models across two clinician-annotated datasets, including HealthBench, the largest publicly available benchmark for medical response evaluation. LLM Judges discriminate complete from incomplete responses at and slightly above near chance (AUC $0.49$--$0.66$); at the threshold required to recall $90\%$ of incomplete responses, clinicians must still review the vast majority of the dataset, offering no triage utility. Even when model and clinician verdicts agree, they rarely cite the same explanation; and when they diverge, false positives stem from over-flagging non-essential gaps while false negatives reflect outright detection failures. These results reveal that LLM Judges and clinicians apply fundamentally different completeness standards; a finding that undermines their use as autonomous evaluators or triage filters in clinical settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates LLM-as-a-Judge systems for detecting incomplete patient-facing medical chatbot responses across three rubric granularities (General-Likert, Analytical-Rubric, Dynamic-Checklist) and three backbone models on two clinician-annotated datasets, including HealthBench. It reports AUCs of 0.49–0.66 (near chance), no triage utility at 90% recall thresholds, and frequent mismatches in explanations even on agreed verdicts, concluding that LLMs and clinicians apply fundamentally different completeness standards and thus cannot serve as autonomous evaluators in clinical settings.
Significance. If the results hold after addressing verification gaps, the work provides useful empirical caution against over-reliance on LLM judges for high-stakes medical evaluation tasks. The inclusion of the large public HealthBench benchmark adds value for reproducibility and generalizability, highlighting risks in automated triage or completeness filtering that could inform safer hybrid evaluation practices.
major comments (2)
- [Methods] Methods: No inter-rater reliability statistics (Cohen’s kappa, Fleiss’ kappa, or raw agreement rates) are reported for clinician annotations on either dataset. Completeness judgments are known to be subjective; without these metrics the observed AUC range (0.49–0.66) and explanation mismatches could partly reflect label noise rather than genuine model–clinician divergence, directly undermining the central claim of fundamentally different standards.
- [Abstract] Abstract and Results: Exact dataset sizes, number of annotations per response, label distributions, and precise application details for the three rubric granularities are not provided. This prevents verification of the reported AUCs and triage implications, which are load-bearing for the no-utility conclusion.
minor comments (2)
- [Results] Results: Specify the exact number of responses evaluated and the proportion labeled complete vs. incomplete for each dataset to allow readers to assess base rates.
- [Discussion] Discussion: Add a brief note on potential selection biases in the clinician annotators or the choice of backbone models.
Simulated Author's Rebuttal
We appreciate the referee's detailed review and the opportunity to clarify aspects of our work. Below we address the major comments point by point.
read point-by-point responses
-
Referee: [Methods] Methods: No inter-rater reliability statistics (Cohen’s kappa, Fleiss’ kappa, or raw agreement rates) are reported for clinician annotations on either dataset. Completeness judgments are known to be subjective; without these metrics the observed AUC range (0.49–0.66) and explanation mismatches could partly reflect label noise rather than genuine model–clinician divergence, directly undermining the central claim of fundamentally different standards.
Authors: We thank the referee for highlighting this important methodological detail. The annotations on both datasets were carried out by a single clinician with expertise in medical communication to ensure high internal consistency, which is a standard practice for such specialized tasks where multiple raters may introduce additional variability. We agree that reporting inter-rater reliability would be beneficial and will revise the Methods section to explicitly state the annotation process and include any available agreement metrics from the source datasets. Additionally, we will discuss the potential impact of label subjectivity on our findings. However, we note that the observed low discrimination (AUC 0.49-0.66) and the detailed analysis of explanation mismatches (even in agreed verdicts) provide convergent evidence for differing standards that goes beyond potential noise in the labels. revision: yes
-
Referee: [Abstract] Abstract and Results: Exact dataset sizes, number of annotations per response, label distributions, and precise application details for the three rubric granularities are not provided. This prevents verification of the reported AUCs and triage implications, which are load-bearing for the no-utility conclusion.
Authors: We acknowledge that the abstract and high-level results summary lack these specifics, which are crucial for reproducibility. The full paper provides the dataset sizes, label distributions, and details on how the three rubric granularities were applied in the Methods section. To address this, we will update the abstract to include key statistics such as dataset sizes and label balances, and add a summary table in the Results for the rubric applications. This will facilitate verification of the AUC computations and the triage utility analysis without altering the conclusions. revision: yes
Circularity Check
No significant circularity; empirical comparison against external clinician labels
full rationale
The paper is an empirical evaluation that computes AUC, triage utility, and explanation mismatch rates by directly comparing LLM judge outputs to clinician annotations on two external datasets (including the public HealthBench benchmark). No equations, fitted parameters, or derivations are present that reduce to the inputs by construction. The central claim—that LLM judges and clinicians apply different standards—emerges from observed performance gaps rather than self-definition, self-citation chains, or renamed known results. Minor self-citations, if any, are not load-bearing for the reported metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Clinician annotations on the datasets represent the correct standard for medical response completeness
Reference graph
Works this paper leans on
-
[1]
The CREATE TRUST Communication Frame- work for Patient Messaging Services.JAMA Internal Medicine, 184(9):999–1000. Khalid Ibraheem Alohali, Laura Asaad Almusaeeb, Abdulaziz Abdulrahman Almubarak, Ahmad Ibra- heem Alohali, and Ruaim Abdullah Muaygil. 2025. Reasoning-based LLMs surpass average human per- formance on medical social skills.Scientific Reports,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Evaluate the clinical significance of that omission based strictly on the rubric. Do not evaluate tone or grammar. - The “criteria_score” field should be an integer of 1, 2, or 3, indicating the quality of the response according to the rubric. # Final instruction Return just the raw JSON object. Do not include markdown formatting or backticks. # Rubric Do...
work page 2025
-
[6]
Identify the core medical issue or symptoms presented by the User
-
[7]
List the standard medical considerations, red flags, or differential diagnoses a real doctor would think of
-
[8]
Explicitly state what the Assistant OMITTED
Compare the Assistant’s response to your list. Explicitly state what the Assistant OMITTED
-
[9]
Evaluate the clinical significance of that omission based strictly on the rubric. Do not evaluate tone or grammar. - The “criteria_score” field should be an integer of 1, 2, 3, 4, or 5 indicating the quality of the response according to the rubric. # Final instruction Return just the raw JSON object. Do not include markdown formatting or backticks. # Rubr...
-
[10]
Extract the clinician’s identified omission(s). If none, write “None identified.”
-
[11]
Extract the model’s identified omission(s). If none, write “None identified.”
-
[12]
Compare the two lists: same specific missing information? Same clinical domain? Does one identify concerns the other does not?
-
[13]
The clinician is ground truth; measure whether the model captured the clinician’s concerns (recall)
Assign omissions_aligned. The clinician is ground truth; measure whether the model captured the clinician’s concerns (recall). The model may flag additional concerns — that does NOT downgrade alignment. •Yes: Model identified ALL (or substantially all) of the clinician’s concerns. For TN rows, “Yes” means the model also found no omission (or only minor no...
-
[14]
Write a 2–3 sentence explanation covering what the clinician focused on, which concerns the model did or did not capture, and any additional concerns the model raised. Output format {"clinician_omissions":"...", "model_omissions":"...", "omissions_aligned":"Yes | Partially | No", "explanation":"2-3 sentence comparison"} {Eight annotated examples (4 TP, 4 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.