arxiv: 2604.25933 · v1 · submitted 2026-04-03 · 💻 cs.CY · cs.AI· cs.CL

A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework

Chenyu Li , Zohaib Akhtar , Mingu Kwak , Yuelyu Ji , Hang Zhang , Tracey Obi , Yufan Ren , Xizhi Wu

show 8 more authors

Sonish Sivarajkumar Harold P. Lehmann Shyam Visweswaran Michael J. Becich Danielle L. Mowery Renxuan Liu Haoyang Sun Yanshan Wang

This is my paper

Pith reviewed 2026-05-13 19:09 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CL

keywords LLM-as-a-Judgehealthcarescoping reviewbias assessmentgovernancevalidationMedJUDGEclinical safety

0 comments

The pith

LLM judges in healthcare often lack human oversight and bias testing, creating a risk that clinically significant errors go undetected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews 49 studies to map how LLMs are used as judges for other model outputs in healthcare. It shows that most work focuses on evaluation and benchmarking using GPT models with simple pointwise scoring, yet only a handful involve any expert review and almost none check for bias or fairness. This combination of limited oversight and model monoculture can let shared blind spots pass undetected, since agreement between judge and system may reflect common errors rather than accuracy. The authors introduce the MedJUDGE framework as a risk-stratified structure built around validity, safety, and accountability to guide safer deployment across different clinical risk levels. A reader would care because these gaps sit at the intersection of scalable AI evaluation and direct patient impact.

Core claim

The scoping review finds that LLM-as-a-Judge applications in healthcare are concentrated in evaluation and benchmarking tasks using GPT-family models and pointwise scoring, with median human validation of only three experts and no bias or demographic fairness testing in the large majority of studies. Only one study reached production deployment. These conditions create a governance gap in which agreement metrics may mask rather than reveal clinically significant errors, especially when judges and evaluated systems share training data or architectures. The MedJUDGE framework counters this by organizing evaluation into three pillars—validity, safety, and accountability—stratified by clinical-r

What carries the argument

The MedJUDGE framework, a risk-stratified three-pillar structure centered on validity, safety, and accountability that supplies deployment-oriented evaluation guidance for healthcare LLM-as-a-Judge systems.

If this is right

Agreement metrics between LLM judges and evaluated systems may fail to distinguish true validity from shared errors when models share training data or architectures.
Deployment decisions for LLM-as-a-Judge tools should be stratified by clinical risk tier rather than treated uniformly.
Future studies would need to include explicit bias, demographic fairness, temporal stability, and patient-context checks to meet the framework's safety pillar.
Only a small fraction of current systems reach production, indicating that structured guidance is required before wider scaling.
Accountability mechanisms such as audit trails and responsibility assignment become necessary components of any production deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the framework is adopted, it could shape institutional review processes for AI tools that affect clinical decisions.
Extending MedJUDGE to include post-deployment monitoring would directly address the observed absence of temporal stability testing.
The same risk-stratified approach might transfer to LLM judgment tasks in other regulated domains such as legal document review.
A practical next step would be to test whether applying the three pillars reduces missed errors in a live clinical workflow.

Load-bearing premise

The 49 included studies sufficiently represent the full landscape of LLM-as-a-Judge use in healthcare and that the proposed MedJUDGE pillars will translate into measurable improvements in real deployments.

What would settle it

A controlled deployment trial in which an LLM judge without MedJUDGE-style validation misses a clinically significant error that a human reviewer catches, or conversely shows no error reduction after applying the three-pillar framework.

Figures

Figures reproduced from arXiv: 2604.25933 by Chenyu Li, Danielle L. Mowery, Hang Zhang, Haoyang Sun, Harold P. Lehmann, Michael J. Becich, Mingu Kwak, Renxuan Liu, Shyam Visweswaran, Sonish Sivarajkumar, Tracey Obi, Xizhi Wu, Yanshan Wang, Yuelyu Ji, Yufan Ren, Zohaib Akhtar.

read the original abstract

As large language models (LLMs) increasingly generate and process clinical text, scalable evaluation has become critical. LLM-as-a-Judge (LaaJ), which uses LLMs to evaluate model outputs, offers a scalable alternative to costly expert review, but its healthcare adoption raises safety and bias concerns. We conducted a PRISMA-ScR scoping review of six databases (January 2020-January 2026), screening 11,727 studies and including 49. The landscape was dominated by evaluation and benchmarking applications (n=37, 75.5%), pointwise scoring (n=42, 85.7%), and GPT-family judges (n=36, 73.5%). Despite growing adoption, validation rigor was limited: among 36 studies with human involvement, the median number of expert validators was 3, while 13 (26.5%) used none. Risk of bias testing was absent in 36 studies (73.5%), only 1 (2.0%) examined demographic fairness, and none assessed temporal stability or patient context. Deployment remained limited, with 1 study (2.0%) reaching production and four (8.2%) prototype stage. Importantly, these gaps may interact: when judges and evaluated systems share training data or architectures, they may inherit similar blind spots, and agreement metrics may fail to distinguish true validity from shared errors. Minimal human oversight, limited bias assessment, and model monoculture together represent a governance gap where current validation may miss clinically significant errors. To address this, we propose MedJUDGE (Medical Judge Utility, De-biasing, Governance and Evaluation), a risk-stratified three-pillar framework organized around validity, safety, and accountability across clinical risk tiers, providing deployment-oriented evaluation guidance for healthcare LaaJ systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This scoping review quantifies thin validation practices for LLM-as-a-Judge in healthcare and proposes the MedJUDGE framework to address the gaps.

read the letter

This scoping review maps out the current state of using LLMs to judge other models in healthcare and highlights how little validation is happening. The main points are that most applications skip bias testing and use very few human checks, which could let clinically important errors slip through. The paper does a solid job pulling together 49 studies from a broad search across six databases. It reports clear percentages: 73.5% use GPT-family judges, 85.7% rely on pointwise scoring, and 73.5% do no risk of bias testing at all. Only one study made it to production. The authors follow the PRISMA-ScR guidelines for scoping reviews and include screening numbers, which makes the methods traceable. They also link the gaps to potential issues like model monoculture leading to shared blind spots, without overstating the evidence. The MedJUDGE framework is new in this context, organizing evaluation around three pillars—validity, safety, and accountability—across different clinical risk tiers. It offers practical guidance for deployment rather than just academic critique. On the softer side, the framework is conceptual and not tested here, so its effectiveness is an open question. The review's reliance on abstracts for some details means full verification of inclusion criteria would require the full texts, though the high-level patterns hold up. The governance gap claim is framed carefully as a possibility rather than a proven outcome. This paper is aimed at AI researchers, clinicians, and regulators working on healthcare AI evaluation. Anyone looking for a current overview of LLM judge practices or ideas for better frameworks will find it useful. It deserves a serious referee because the synthesis is evidence-based and the proposal could help shape safer adoption.

Referee Report

1 major / 3 minor

Summary. This scoping review follows PRISMA-ScR to map LLM-as-a-Judge (LaaJ) applications in healthcare, screening 11,727 records from six databases (Jan 2020–Jan 2026) and including 49 studies. It reports dominance of evaluation/benchmarking uses (n=37, 75.5%), pointwise scoring (n=42, 85.7%), and GPT-family judges (n=36, 73.5%), alongside limited validation (median 3 expert validators among 36 studies with humans; 73.5% with no bias testing; 2.0% examining demographic fairness; zero assessing temporal stability or patient context) and minimal deployment (2.0% production, 8.2% prototype). The authors identify a governance gap arising from minimal oversight, absent bias assessment, and model monoculture that may allow clinically significant errors to go undetected, and propose the MedJUDGE (Medical Judge Utility, De-biasing, Governance and Evaluation) risk-stratified three-pillar framework addressing validity, safety, and accountability.

Significance. If the quantified landscape and gap analysis hold, the review supplies a timely, methodologically transparent baseline for an emerging evaluation paradigm in clinical AI. It credits the PRISMA-ScR adherence and explicit counts of practice distributions, which enable reproducible gap tracking. The MedJUDGE proposal offers deployment-oriented guidance that could reduce safety risks if operationalized, though its value depends on subsequent empirical testing rather than being demonstrated here.

major comments (1)

[Abstract and Results] Abstract and Results: The governance-gap claim that 'current validation may miss clinically significant errors' rests on the interaction of three quantified gaps (minimal human oversight, 73.5% absent bias testing, GPT monoculture). However, the manuscript provides no concrete examples from the 49 studies of undetected clinical errors traceable to these gaps, leaving the causal inference load-bearing for the central safety argument without direct evidentiary support.

minor comments (3)

[Abstract] Abstract: The search window is stated as January 2020–January 2026; confirm whether this is a typographical error or an intended forward-looking cutoff, as it directly affects reproducibility of the screening numbers.
[Methods] Methods: Although PRISMA-ScR is cited, the manuscript should report inter-rater reliability (e.g., Cohen’s kappa) for the dual screening of 11,727 records to strengthen confidence in the 49 included studies and the derived percentages.
[Discussion] Discussion: The three pillars of MedJUDGE are introduced at a high level; add at least one operational metric or checklist item per pillar (validity, safety, accountability) with reference to existing clinical evaluation standards to make the framework immediately usable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The governance-gap claim that 'current validation may miss clinically significant errors' rests on the interaction of three quantified gaps (minimal human oversight, 73.5% absent bias testing, GPT monoculture). However, the manuscript provides no concrete examples from the 49 studies of undetected clinical errors traceable to these gaps, leaving the causal inference load-bearing for the central safety argument without direct evidentiary support.

Authors: We acknowledge that the manuscript does not provide concrete examples of undetected clinical errors drawn from the 49 included studies. As a scoping review, our synthesis is restricted to data reported in the published literature; none of the studies documented specific instances in which the identified gaps (low human oversight, absent bias testing, or model monoculture) resulted in missed clinical errors. The governance-gap claim is therefore inferential, grounded in the quantified distributions we report together with established concerns in the AI safety literature regarding shared blind spots and the limitations of agreement metrics. We agree that greater transparency is warranted. We will revise the abstract and the relevant discussion paragraph to state explicitly that the risk of missing clinically significant errors is a hypothesized consequence of the observed landscape rather than a claim supported by direct evidence of harm within the 49 studies. This clarification preserves the precautionary framing while accurately reflecting the evidentiary basis of the review. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework derived from external literature gaps

full rationale

The paper is a PRISMA-ScR scoping review that screens 11,727 studies from six external databases and includes 49 independent papers. Landscape statistics (e.g., 73.5% GPT judges, 73.5% no bias testing) and the governance-gap claim are direct aggregates of those external findings. The MedJUDGE framework is explicitly positioned as risk-stratified guidance synthesized from the observed patterns rather than from any fitted parameter, self-definition, or self-citation chain. No equations exist; no load-bearing premise reduces to an author-overlapping citation or to a renaming of the input data. Minor self-citation, if present, is not required for the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on standard scoping-review methodology and introduces the MedJUDGE framework as a synthesis of observed gaps without new empirical validation or external benchmarks.

axioms (1)

domain assumption PRISMA-ScR guidelines provide an adequate structure for identifying gaps in LLM-as-a-Judge literature
Invoked to justify the review protocol and screening process.

invented entities (1)

MedJUDGE framework no independent evidence
purpose: Risk-stratified three-pillar structure for validity, safety, and accountability in healthcare LLM judges
Newly proposed to address identified governance gaps; no independent validation or falsifiable predictions provided in the abstract.

pith-pipeline@v0.9.0 · 5711 in / 1216 out tokens · 47711 ms · 2026-05-13T19:09:03.975176+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean; IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction; washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose MedJUDGE (Medical Judge Utility, De-biasing, Governance and Evaluation), a risk-stratified three-pillar conceptual framework organized around validity, safety, and accountability across three clinical risk tiers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

A Survey on LLM-as-a-Judge

Center for Devices & Radiological Health. Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions. U.S. Food and Drug Administration https://www.fda.gov/regulatory-information/search-fda-guidance-documents/marketing-submission-recommendations-predetermined-change-control-pl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15594 2025
[2]

Ozmen, B. B. et al. MicroRAG: Development of a novel artificial intelligence retrieval-augmented generation model for microsurgery clinical decision support. Microsurgery 45, e70138 (2025). 51. Ozmen, B. B. et al. Development of a novel artificial intelligence clinical decision support tool for hand surgery: HandRAG. J. Hand Microsurg. 17, 100293 (2025). ...

work page doi:10.18653/v1/2025.findings-acl.1278 2025
[3]

Santilli, A. et al. Revisiting uncertainty Quantification evaluation in Language Models: Spurious interactions with response length bias results. arXiv [cs.CL] (2025). 63. Long, D. X. et al. LLMs are biased towards output formats! Systematically evaluating and mitigating output format bias of LLMs. arXiv [cs.CL] (2024). 64. Kim, E., Garg, A., Peng, K. & G...

work page doi:10.1145/3658644.3690291 2025
[4]

Zhang, Y. et al. UDA: Unsupervised Debiasing Alignment for pair-wise LLM-as-a-judge. arXiv [cs.AI] (2025) doi:10.48550/arXiv.2508.09724. 75. Li, H. et al. CalibraEval: Calibrating prediction distribution to mitigate selection bias in LLMs-as-judges. in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

work page doi:10.48550/arxiv.2508.09724 2025
[5]

A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Governance Framework

Hosseini, P. et al. A benchmark for long-form medical question answering. arXiv [cs.CL] (2024). 88. Regulation - EU - 2024/1689 - EN - EUR-Lex. http://data.europa.eu/eli/reg/2024/1689/oj. 89. Ethics and Governance of Artificial Intelligence for Health: Guidance on Large Multi-Modal Models. 90. Kocaman, V., Kaya, M. A., Feier, A. M. & Talby, D. Clinical La...

work page doi:10.18653/v1/2024.acl-long.413 2024