pith. sign in

arxiv: 2605.25273 · v1 · pith:PPLUOVGOnew · submitted 2026-05-24 · 💻 cs.CY

LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

Pith reviewed 2026-06-29 23:16 UTC · model grok-4.3

classification 💻 cs.CY
keywords LLM-as-a-Judgehealthcare evaluationhuman alignmentscoping reviewclinical decision supportmedical NLPprompt engineering
0
0 comments X

The pith

A review of 134 studies maps LLM-as-a-Judge uses in healthcare and finds moderate to strong human alignment that varies by task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a PRISMA-guided scoping review to characterize how LLMs evaluate other AI outputs in healthcare. It codes the 134 eligible studies by application area, judge setup, technical methods, and validation approach. A reader would care because healthcare AI produces unstructured text that is expensive to assess manually at scale, so an automated judge could accelerate testing if it tracks expert opinion. The synthesis shows concentration in four main scenarios, heavy use of OpenAI models and prompt engineering, and alignment that is often but not always reliable.

Core claim

The review establishes that LLM-as-a-Judge appears most often in clinical decision support, clinical NLP, medical knowledge QA, and medical communication. OpenAI models serve as judges in the largest share of studies, nearly every study employs prompt engineering, and ensemble, multi-agent, or retrieval-augmented variants appear as frequent extensions. Among the subset of studies that include human validation, LLM judgments align moderately to strongly with expert judgments, although the degree of agreement differs markedly across tasks.

What carries the argument

The PRISMA-guided scoping review that screens five databases and codes each study on health scenario, judge configuration, technical approach, and validation design.

If this is right

  • LLM-as-a-Judge can serve as a scalable evaluation tool inside clinical decision support and medical QA when human validation is performed.
  • Prompt engineering remains the dominant technique, with ensemble and multi-agent designs as practical extensions.
  • Reliability must be checked per task because alignment strength is not uniform.
  • Clinical adoption requires continued attention to model choice and validation rigor rather than blanket deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the task-dependent alignment pattern persists, developers could prioritize validation effort on high-variability tasks first.
  • The observed dominance of certain models suggests testing whether open-source alternatives achieve comparable agreement in the same healthcare scenarios.
  • The concentration in four areas points to possible gaps in domains such as patient-facing monitoring or surgical planning that future studies could address.

Load-bearing premise

The 134 studies found through the database search form a representative sample of LLM-as-a-Judge work in healthcare and that the coding categories accurately reflect the main methodological differences.

What would settle it

A new literature search that locates many additional studies with markedly different application distributions or substantially lower human-LLM agreement rates would undermine the reported patterns.

Figures

Figures reproduced from arXiv: 2605.25273 by Cathy Shyr, Chen Chen, Deyi Li, Lingyao Li, Lizhou Fan, Mei Liu, Mingquan Lin, Renkai Ma, Rui Yin, Runlong Yu, Siyuan Ma, Steven Bethard.

Figure 1
Figure 1. Figure 1: PRISMA flow diagram illustrating the study filtering process. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LLM-as-a-Judge frameworks in healthcare. The framework typically includes adaptation strategies, evaluation rubrics and metrics, and common judge biases with corresponding mitigation approaches. 2.4 LLM-as-a-Judge Techniques To describe how LLM-as-a-Judge systems are implemented in healthcare studies, we identify the following major technical strategies used to shape the judge’s behavior, as il… view at source ↗
Figure 3
Figure 3. Figure 3: Publication trend of LLM-as-a-Judge studies in healthcare during the study period. Bars show the [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LLM-as-a-Judge research areas in healthcare. Major research domains include Clinical Decision Support, Clinical NLP, Medical Knowledge & QA, and Medical Communication. quality, although fairness and robustness remain relatively underexplored. For example, Hosseini et al. [53] use LLM judges to evaluate long-form medical QA responses for correctness, helpfulness, harmfulness, efficiency, and bias alignment … view at source ↗
Figure 5
Figure 5. Figure 5: Detailed categorization of LLM-as-a-Judge research in healthcare. Included studies are grouped according to major healthcare research domains and their corresponding application tasks. DeepSeek (n=13, 9.7%) (Figure 6a). Most studies employ models from multiple families, consistent with multi-model jury configurations designed to reduce provider-specific biases. Regarding the judge techniques, prompt engine… view at source ↗
Figure 6
Figure 6. Figure 6: Model family and technique distributions for LLM-as-a-Judge in healthcare based on N=134 studies. (a) Number of studies employing each model family, counted at the study level. (b) Distribution of technical approaches for configuring LLM judges. Techniques are not mutually exclusive. Technical approaches. For prompting strategies, rubric-based prompting is predominant, which encodes structured evaluation c… view at source ↗
Figure 7
Figure 7. Figure 7: Specific model versions used as LLM judges. Each subplot shows the top most frequently used models within each major model family: OpenAI (n=90 studies), Google (n=35), Meta (n=25), Anthropic (n=27), Alibaba (n=24), and DeepSeek (n=13). Numbers on each node indicate the count of studies employing that model version; percentages reflect the proportion of total included studies (N=134). “Others” aggregates r… view at source ↗
Figure 8
Figure 8. Figure 8: Agreement rate between LLM-as-a-Judge and human experts Forest plot of 13 examined studies reporting agreement rate with 95% confidence intervals sorted by agreement rate. The dotted line marks the median agreement rate (0.83), and the dashed line marks the mean agreement rate (0.83) across the 13 examined studies. Agreement values are rounded to two decimal places. Chance-corrected agreement (Cohen’s 𝜅). … view at source ↗
Figure 9
Figure 9. Figure 9: Chance-corrected agreement (Cohen’s 𝜅) between LLM-as-a-Judge and human experts. Forest plot of 10 studies reporting Cohen’s 𝜅 with 95% confidence intervals and validation sample sizes (𝑁), sorted by 𝜅. The dotted line marks the median Cohen’s 𝜅 (0.78), and the dashed line marks the mean Cohen’s 𝜅 (0.76) across the 10 examined studies. Cohen’s 𝜅 values are rounded to two decimal places. Score-level correla… view at source ↗
Figure 10
Figure 10. Figure 10: Score-level correlation between LLM-as-a-Judge and human expert ratings. Forest plot of 13 studies reporting Pearson’s 𝑟 (P, 𝑛 = 10) or Spearman’s 𝜌 (S, 𝑛 = 3) with 95% confidence intervals, sorted by correlation. The dotted line marks the median correlation value (0.69), and the dashed line marks the mean correlation value (0.68) across the 13 examined studies. Correlation values are rounded to two decim… view at source ↗
Figure 11
Figure 11. Figure 11: Geographic distribution of LLM-as-a-Judge studies in healthcare. World map showing the number of publications by country, colored by paper count. The United States contributed the largest share (n=66, 48.9%), followed by China (n=19), the United Kingdom (n=14), Canada (n=10), and Germany (n=8). Research contributions spanned 30 countries across six continents. Based on analysis of n=134 included studies. … view at source ↗
Figure 12
Figure 12. Figure 12: Institutional collaboration network for LLM-as-a-Judge research in healthcare. Nodes repre￾sent institutions, sized by the number of contributing studies and colored by country/region. Edges indicate co-authorship on at least one study, with solid lines for intra-community collaborations and dashed lines for inter-community collaborations. Shaded regions denote communities identified by Louvain modularity… view at source ↗
Figure 13
Figure 13. Figure 13: Temporal trends in open-source vs. closed-source LLM judge adoption. Stacked bar chart showing the number of studies per bimonthly period classified by model type: closed-source only (blue), open￾source only (green), and both (gray). Dashed lines show cumulative counts of studies using any closed-source model (blue) and any open-source model (green). Studies using both types are counted in both cumulative… view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed across healthcare applications, including clinical documentation, diagnostic reasoning, medicine recommendation, and medical education. Their outputs are largely unstructured clinical text, which is difficult to reliably evaluate at scale. LLM-as-a-Judge, in which an LLM evaluates another system's output against task-specific criteria, offers a scalable alternative and is increasingly used in clinical evaluation, yet its validity in healthcare remains underexamined. Existing reviews focus on general-purpose LLM evaluation or on risk framework, rather than systematically characterizing how LLM-as-a-Judge is applied in healthcare and how well their judgments align with human experts. We therefore conduct a PRISMA-guided comprehensive review of LLM-as-a-Judge applications in healthcare, searching five databases for studies published between January 2023 and February 2026. After screening 541 records, 134 studies meet the eligibility and are coded by health scenario, judge configuration, technical approach, and validation design. LLM-as-a-Judge is concentrated in clinical decision support, clinical natural language processing (NLP), medical knowledge and question answering (QA), and medical communication. OpenAI models are the most frequently used judges, and prompt engineering appears in nearly all studies, with ensemble, multi-agent, and retrieval-augmented designs as common extensions. Among studies reporting human validation, LLM judges often show moderate to strong alignment with expert judgments, although reliability varies substantially across tasks. Overall, this review positions LLM-as-a-Judge as a promising framework for scalable healthcare AI evaluation, while emphasizing that its clinical value depends on model design and rigorous validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript is a PRISMA-guided scoping review of LLM-as-a-Judge applications in healthcare. It searched five databases for studies from January 2023 to February 2026, screened 541 records, and included 134 studies that were coded on health scenario, judge configuration, technical approach, and validation design. Findings indicate concentration in clinical decision support, clinical NLP, medical QA, and communication; dominance of OpenAI models; near-universal use of prompt engineering with extensions like ensembles and RAG; and, among studies reporting human validation, moderate to strong alignment with expert judgments (with substantial task-to-task variability). The paper positions LLM-as-a-Judge as a promising scalable evaluation framework whose clinical value depends on model design and rigorous validation.

Significance. If the included corpus is representative and the coding reliable, the review supplies a useful field map that documents current practices and the uneven state of human-alignment evidence. This is valuable for guiding future work on scalable clinical AI evaluation. The PRISMA structure and multi-database search are strengths that lend some transparency to the synthesis.

major comments (1)
  1. [Methods] Methods (screening and coding description): The PRISMA process is summarized at a high level (541 records screened to 134 included), but the manuscript does not report the exact search strings, detailed eligibility criteria, or inter-coder reliability statistics for either the initial screening or the subsequent coding of health scenario, judge configuration, and validation design. This is load-bearing for the central claims, because statements about prevalence of applications and the frequency/strength of human alignment inherit any selection or reporting bias present in the 134-study sample.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our scoping review. We address the single major comment below and will revise the manuscript accordingly to improve transparency.

read point-by-point responses
  1. Referee: [Methods] Methods (screening and coding description): The PRISMA process is summarized at a high level (541 records screened to 134 included), but the manuscript does not report the exact search strings, detailed eligibility criteria, or inter-coder reliability statistics for either the initial screening or the subsequent coding of health scenario, judge configuration, and validation design. This is load-bearing for the central claims, because statements about prevalence of applications and the frequency/strength of human alignment inherit any selection or reporting bias present in the 134-study sample.

    Authors: We agree that the current high-level summary of the PRISMA process limits reproducibility and that detailed reporting of search strings, eligibility criteria, and inter-coder reliability is necessary to support the prevalence and alignment claims. In the revised manuscript we will add the exact search strings used in each of the five databases, a complete enumerated list of eligibility criteria, and inter-coder reliability statistics (percentage agreement and Cohen’s kappa) for both the title/abstract screening and the full-text coding phases. revision: yes

Circularity Check

0 steps flagged

No circularity: scoping review with no derivations or predictions

full rationale

This is a PRISMA-guided scoping review that screens 541 records to include 134 studies, codes them by health scenario/judge configuration/validation design, and summarizes descriptive patterns such as concentration in clinical decision support and moderate-to-strong human alignment where reported. No equations, fitted parameters, predictions, or self-citation chains exist that could reduce any claim to its inputs by construction. The work is a literature synthesis whose central statements are empirical summaries of the screened corpus; the representativeness concern raised by the skeptic is a question of external validity and reporting transparency, not circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard systematic review practices and the assumption that the screened literature accurately reflects current LLM-as-a-Judge use in healthcare.

axioms (1)
  • domain assumption PRISMA guidelines for scoping reviews provide a valid and unbiased method for identifying and coding relevant studies
    Invoked to justify screening 541 records down to 134 eligible studies and subsequent coding of scenarios and validation designs

pith-pipeline@v0.9.1-grok · 5860 in / 1352 out tokens · 33152 ms · 2026-06-29T23:16:57.606408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

178 extracted references · 89 canonical work pages · 15 internal anchors

  1. [1]

    Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R

    Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R. Keet, Minjoung Go, E...

  2. [2]

    Tassallah Abdullahi, Shrestha Ghosh, Hamish S Fraser, Daniel León Tramontini, Adeel Abbasi, Ghada Bourjeily, Carsten Eickhoff, and Ritambhara Singh. 2026. The Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models.arXiv preprint arXiv:2601.05376(2026)

  3. [3]

    Ali Abedi, Charlene H Chu, and Shehroz S Khan. 2026. Evidence-Informed Guidance on Cannabidiol Use in Older Adults: Development and Evaluation of Retrieval-Augmented Large Language Models.arXiv preprint arXiv:2604.09548 (2026)

  4. [4]

    Moaiz Abrar, Yusuf Sermet, and Ibrahim Demir. 2025. An empirical evaluation of large language models on consumer health questions.BioMedInformatics5, 1 (2025), 12

  5. [5]

    Mouath Abu-Daoud, Leen Kharouf, Omar El Hajj, Dana El Samad, Mariam Al-Omari, Jihad Mallat, Khaled Saleh, Nizar Habash, and Farah E Shamout. 2026. MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark.arXiv preprint arXiv:2602.01714(2026)

  6. [6]

    Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, and Tareque Mohmud Chowdhury

  7. [7]

    Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation.arXiv preprint arXiv:2602.14564(2026)

  8. [8]

    Kiran Adnan, Rehan Akbar, Siak Wang Khor, and Adnan Bin Amanat Ali. 2020. Role and challenges of unstructured big data in healthcare.Data management, analytics and innovation(2020), 301–323

  9. [9]

    Lara Ahrens, Wilhelm Haverkamp, and Nils Strodthoff. 2025. ECG-LLM–training and evaluation of domain-specific large language models for electrocardiography.arXiv preprint arXiv:2510.18339(2025)

  10. [10]

    Abdel-Karim Al Tamimi, Jacob Andrews, Jacqueline Benfield, Cath Sweby, Chris Gilmartin, Rebecca Lindley, Diane Trusson, Molly Dziunka, Dee Webster, and Kathryn Radford. 2026. Development and Qualitative Evaluation of R-Speak: Acceptability and Usability of a Smartphone App System Using AI to Enhance Communication in People With Expressive Aphasia. (2026)

  11. [11]

    Raviteja Anantha, Soheil Hor, Teodor Nicola Antoniu, and Layne C Price. 2025. NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning.arXiv preprint arXiv:2509.23252(2025)

  12. [12]

    Elham Asgari, Nina Montaña-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, and Dominic Pimenta. 2025. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ digital medicine8, 1 (2025), 274

  13. [13]

    Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Prathiba Dhanesh, Jimmy Xiangji Huang, Frank Rudzicz, and Elham Dolatabadi. 2026. When can we trust llms in mental LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment 25 health? large-scale benchmarks for reliable llm...

  14. [14]

    Omid Bazgir, Vineeth Manthapuri, Ilia Rattsev, and Mohammad Jafarnejad. 2025. GRASP: Graph Reasoning Agents for Systems Pharmacology with Human-in-the-Loop.arXiv preprint arXiv:2512.05502(2025)

  15. [15]

    Kate H Bentley, Luca Belli, Adam M Chekroud, Emily J Ward, Emily R Dworkin, Emily Van Ark, Kelly M Johnston, Will Alexander, Millard Brown, and Matt Hawrilenko. 2026. VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health.arXiv preprint arXiv:2602.05088(2026)

  16. [16]

    Lungren, and Thomas Osborne

    Matthias Blondeel, Noel Codella, Sam Preston, Hao Qiu, Leonardo Schettini, Frank Tuan, Wen-wai Yim, Smitha Saligrama, Mert Öz, Shrey Jain, Matthew P. Lungren, and Thomas Osborne. 2025. Healthcare Agent Orchestrator (HAO) for Patient Summarization in Molecular Tumor Boards.arXiv preprint arXiv:2509.06602(2025)

  17. [17]

    Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, and Iacer Calixto. 2025. DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries.arXiv preprint arXiv:2506.16777 (2025)

  18. [18]

    Marco Bolpagni, Simone De Carli, Leonardo Sanna, Mauro Dragoni, and Silvia Gabrielli. 2025. VALISE: A Virtual Agent Laboratory for Instruction-Following Simulation and Evaluation of LLM-Powered Digital Health Interventions. InFrontiers in Artificial Intelligence and Applications. Vol. 413. 5096–5099

  19. [19]

    Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, and Mingming Fan. 2025. Exploring safety alignment evaluation of llms in chinese mental health dialogues via llm-as-judge.arXiv preprint arXiv:2508.08236(2025)

  20. [20]

    Gonzalo Cardenal-Antolin, Jacques Fellay, Bashkim Jaha, Roger Kouyos, Niko Beerenwinkel, and Diane Duroux. 2025. HIVMedQA: Benchmarking large language models for HIV medical decision support.arXiv preprint arXiv:2507.18143 (2025)

  21. [21]

    Edward Y Chang and Ethan Y Chang. 2025. Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning.arXiv preprint arXiv:2510.04488(2025)

  22. [22]

    Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, and Dakuo Wang. 2025. Multi-agent-as-judge: Aligning LLM-agent-based automated evaluation with multi-dimensional human evaluation.arXiv preprint arXiv:2507.21028(2025)

  23. [23]

    Xiuyuan Chen, Tao Sun, Dexin Su, Ailing Yu, Junwei Liu, Zhe Chen, Gangzeng Jin, Xin Wang, Jingnan Liu, Hansong Xiao, Hualei Zhou, Dongjie Tao, Chunxiao Guo, Minghui Yang, Yuan Xia, Jing Zhao, Qianrui Fan, Yanyun Wang, Shuai Zhen, Kezhong Chen, Jun Wang, Zewen Sun, Heng Zhao, Tian Guan, Shaodong Wang, Geyun Chang, Jiaming Deng, Hongchengcheng Chen, Kexin F...

  24. [24]

    Yuhao Chen, Bo Wen, and Farhana Zulkernine. 2025. A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study.JMIR AI4 (2025), e75932

  25. [25]

    Zhihui Chen and Mengling Feng. 2025. Med-Banana-50K: A Cross-modality Large-Scale Dataset for Text-guided Medical Image Editing.arXiv preprint arXiv:2511.00801(2025)

  26. [26]

    Miller, Majid Afshar, and Yanjun Gao

    He Cheng, Yifu Wu, Saksham Khatwani, Maya Kruse, Dmitriy Dligach, Timothy A. Miller, Majid Afshar, and Yanjun Gao. 2026. Scaling Biomedical Knowledge Graph Retrieval for Interpretable Reasoning: Applications to Clinical Diagnosis Prediction.medRxiv(2026). doi:10.64898/2026.01.12.26343957

  27. [27]

    Cheng-Han Chiang, Hung-yi Lee, and Michal Lukasik. 2025. Tract: Regression-aware fine-tuning meets chain-of- thought reasoning for llm-as-a-judge. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2934–2952

  28. [28]

    Goodell, Yeasul Kim, S

    Philip Chung, Akshay Swaminathan, Alex J. Goodell, Yeasul Kim, S. Momsen Reincke, Lichy Han, Ben Deverett, Mohammad Amin Sadeghi, Abdel-Badih Ariss, Marc Ghanem, David Seong, Andrew A. Lee, Caitlin E. Coombes, Brad Bradshaw, Mahir A. Sufian, Hyo Jung Hong, Teresa P. Nguyen, Mohammad R. Rasouli, Komal Kamra, Mark A. Burbridge, James C. McAvoy, Roya Saffary...

  29. [29]

    Ionut, Croitoru, Cristina Elena Turcu, and Corneliu Octavian Turcu. 2026. Privacy-by-Design in AI-Assisted Systems for Caregivers of Children with Autism: A Secure Multi-Agent Architecture.Applied Sciences16, 4 (2026), 2157

  30. [30]

    Churpek, Anoop Mayampurath, Frank Liao, Cherodeep Goswami, Karen K

    Emma Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, John Caskey, Madeline Oguss, Graham Wills, Guanhua Chen, Dmitriy Dligach, Matthew M. Churpek, Anoop Mayampurath, Frank Liao, Cherodeep Goswami, Karen K. Wong, Brian W. Patterson, and Majid Afshar. 2025. Evaluating clinical AI summaries with large language models as judges.npj D...

  31. [31]

    Corey Curran, Nafis Neehal, Keerthiram Murugesan, and Kristin P Bennett. 2024. Examining trustworthiness of llm-as-a-judge systems in a clinical trial design benchmark. In2024 IEEE International Conference on Big Data (BigData). 26 Li et al. IEEE, 4627–4631

  32. [32]

    Hong-Jie Dai, Zheng-Hao Li, An-Tai Lu, Bo-Tsz Shain, Ming-Ta Li, Tatheer Hussain Mir, Kuang-Te Wang, Min-I Su, Pei-Kang Liu, and Ming-Ju Tsai. 2025. Model selection meets clinical semantics: Optimizing ICD-10-CM prediction via LLM-as-Judge evaluation, redundancy-aware sampling, and section-aware fine-tuning.arXiv preprint arXiv:2509.18846 (2025)

  33. [33]

    Amanda Dawson and Sergei Ananyan. 2017. The Role of Unstructured Data in Healthcare Analytics. InActionable Intelligence in Healthcare. Auerbach Publications, 241–262

  34. [34]

    Iker De la Iglesia, Iakes Goenaga, Johanna Ramirez-Romero, Jose Maria Villa-Gonzalez, Josu Goikoetxea, and Ander Barrena. 2025. Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments. InProceedings of the 31st International Conference on Computational Linguistics. 9456–9471

  35. [35]

    Fajardo V Diego, Oleksii Proniakin, Victoria-Elisabeth Gruber, and Razvan Marinescu. 2026. MedPI: Evaluating AI Systems in Medical Patient-facing Interactions.medRxiv(2026), 2025–12

  36. [36]

    Philip DiGiacomo, Haoyang Wang, Jinrui Fang, Yan Leng, W Michael Brode, and Ying Ding. 2025. Demo: Guide- RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID.arXiv preprint arXiv:2510.15782(2025)

  37. [37]

    Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, and Jie Xu. 2025. MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Age...

  38. [38]

    Zachary Ellis, Jared Joselowitz, Yash Deo, Yajie Vera He, Anna Kalygina, Aisling Higham, Mana Rahimzadeh, Yan Jia, Ibrahim Habli, and Ernest Lim. 2026. Wer is unaware: Assessing how asr errors distort clinical understanding in patient facing dialogue. InProceedings of the 16th International Workshop on Spoken Dialogue System Technology. 391–417

  39. [39]

    Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, and Maksym Andriushchenko. 2026. HalluHard: A Hard Multi-Turn Hallucination Benchmark.arXiv preprint arXiv:2602.01031(2026)

  40. [40]

    Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, and Lequan Yu. 2026. PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 30717–30725

  41. [41]

    Rahatara Ferdousi and M Anwar Hossain. 2025. RHealthTwin: Towards Responsible and Multimodal Digital Twins for Personalized Well-being.arXiv preprint arXiv:2506.08486(2025)

  42. [42]

    Aya E Fouda, Abdelrahman A Hassan, Radwa J Hanafy, and Mohammed E Fouda. 2026. PsychiatryBench: a multi-task benchmark for LLMs in psychiatry.npj Digital Medicine(2026)

  43. [43]

    Yasuko Fukataki, Wakako Hayashi, Megumi Kitayama, and Yoichi M Ito. 2026. Measurement of retrieved chunk quality from real-world knowledge in retrieval-augmented generation: A Phase 1 foundational study.medRxiv(2026), 2026–01

  44. [44]

    Chunjing Gan, Dan Yang, Binbin Hu, Ziqi Liu, Yue Shen, Zhiqiang Zhang, Jian Wang, and Jun Zhou. 2025. POLYRAG: Integrating Polyviews into Retrieval-Augmented Generation for Medical Applications.arXiv preprint arXiv:2504.14917 (2025)

  45. [45]

    Cool, Zahir Kanjee, Andrew S

    Ethan Goh, Robert Gallo, Jason Hom, Eric Strong, Yingjie Weng, Hannah Kerman, Joséphine A. Cool, Zahir Kanjee, Andrew S. Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, Andrew P. J. Olson, Adam Rodman, and Jonathan H. Chen. 2024. Large language model influence on diagnostic reasoning: a randomized clinical trial.JAMA network open7, 10 (2...

  46. [46]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Zhouchi Lin, Bowen Zhang, Lionel Ni, Wen Gao, Yuanzhuo Wang, and Jian Guo. 2024. A survey on llm-as-a-judge.The Innovation(2024)

  47. [47]

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. 2025. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746(2025)

  48. [48]

    Jaewook Han, Beomgi So, Whanbo Jung, Minwoo Kim, Hoon Kim, and Daun Shin. 2026. Optimizing small lo- cal language models for culturally competent mental health counseling: comparative evaluation with GPT-4o by psychiatrists.JMIR Preprints(2026). doi:10.2196/preprints.92470

  49. [49]

    Ryan Han, Julián N Acosta, Zahra Shakeri, John PA Ioannidis, Eric J Topol, and Pranav Rajpurkar. 2024. Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review.The lancet digital health6, 5 (2024), e367–e373

  50. [50]

    HM Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, and Randy Goebel. 2025. Reason2Decide: Rationale- Driven Multi-Task Learning.arXiv preprint arXiv:2512.20074(2025)

  51. [51]

    Qing He, Dongsheng Bi, Jianrong Lu, Minghui Yang, Zixiao Chen, Jiacheng Lu, Jing Chen, Nannan Du, Xiao Cu, Sijing Wu, Peng Xiang, Yinyin Hu, Yi Guo, Chunpu Li, Shaoyang Li, Zhuo Dong, Ming Jiang, Shuai Guo, Liyun Feng, LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment 27 Jin Peng, Jian Wang, Jinjie Gu, and Junw...

  52. [52]

    Zhe He, Balu Bhasuran, Qiao Jin, Shubo Tian, Karim Hanna, Cindy Shavor, Lisbeth Garcia Arguello, Patrick Murray, and Zhiyong Lu. 2024. Quality of answers of generative large language models versus peer users for interpreting laboratory test results for lay patients: evaluation study.Journal of medical Internet research26 (2024), e56655

  53. [53]

    Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, and Eiji Aramaki. 2025. Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system.arXiv preprint arXiv:2509.17444(2025)

  54. [54]

    Pedram Hosseini, Jessica M Sin, Bing Ren, Bryceton G Thomas, Elnaz Nouri, Ali Farahanchi, and Saeed Hassanpour

  55. [55]

    A benchmark for long-form medical question answering.arXiv preprint arXiv:2411.09834(2024)

  56. [56]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3

  57. [57]

    Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, and Bryan Hooi. 2026. Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs.arXiv preprint arXiv:2601.08763(2026)

  58. [58]

    Dongsuk Jang, Ziyao Shangguan, Kyle Tegtmeyer, Anurag Gupta, Jan T Czerminski, Sophie Chheang, and Arman Cohan. 2025. MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 319–353

  59. [59]

    Kennedy, Sebastian Lobentanzer, and Georg Fuellen

    Hans Jarchow, Christoph Bobrowski, Steffi Falk, Andreas Hermann, Anton Kulaga, Johann-Christian Põder, Maxim- ilian Unfried, Nikolay Usanov, Bijan Zendeh, Brian K. Kennedy, Sebastian Lobentanzer, and Georg Fuellen. 2025. Benchmarking large language models for personalized, biomarker-based health intervention recommendations.NPJ Digital Medicine8, 1 (2025), 631

  60. [60]

    Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, and Alan Schelten. 2025. Compute as teacher: Turning inference compute into reference-free supervision.arXiv preprint arXiv:2509.14234(2025)

  61. [61]

    Seyeon Jeong, Yeonjun Choi, JongWook Kim, and Beakcheol Jang. 2026. Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval.arXiv preprint arXiv:2601.04742(2026)

  62. [62]

    Md Abdullah Al Kafi, Raka Moni, and Sumit Kumar Banshal. 2026. Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG-Based Mental Health Dialogue Systems.arXiv preprint arXiv:2601.01341(2026)

  63. [63]

    Tomiris Kaumenova, Subhankar Chakraborty, Eric Fosler-Lussier, Kebire Gofar, Isaiah Metcalf, Andrew Perrault, and Michael White. 2025. Evaluating Large Language Models for Colonoscopy Preparation Assistance: Correctness and Diversity in Synthetic Dialogues.medRxiv(2025), 2025–11

  64. [64]

    Saksham Khatwani, He Cheng, Majid Afshar, Dmitriy Dligach, and Yanjun Gao. 2025. Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning.arXiv preprint arXiv:2509.18316(2025)

  65. [65]

    Edward Kim, Richard Foty, Manil Shrestha, and Vicki Seyfert-Margolis. 2025. Conformal prediction and verification of large language model extractions in EHR data. InProceedings of the AAAI Symposium Series, Vol. 7. 539–546

  66. [66]

    Jiwon Kim, Violeta J Rodriguez, Dong Whi Yoo, Eshwar Chandrasekharan, and Koustuv Saha. 2026. PAIR-SAFE: A Paired-Agent Approach for Runtime Auditing and Refining AI-Mediated Mental Health Support.arXiv preprint arXiv:2601.12754(2026)

  67. [67]

    Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai Xu, Xin Liu, Chunjong Park, Hyeonhoon Lee, Hae Won Park, Daniel McDuff, Samir Tulebaev, and ...

  68. [68]

    Masamune Kobayashi, Masato Mita, and Mamoru Komachi. 2024. Large language models are state-of-the-art evaluator for grammatical error correction. InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 68–77

  69. [69]

    Veysel Kocaman, Mustafa Aytuğ Kaya, Andrei Marian Feier, and David Talby. 2025. Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation.JMIR AI4, 1 (2025), e72153

  70. [70]

    Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwanghyun Kim, Jeewon Yang, Se- unghyun Won, and Edward Choi. 2024. Ehrnoteqa: An llm benchmark for real-world clinical practice using discharge summaries.Advances in Neural Information Processing Systems37 (2024), 124575–124611

  71. [71]

    Jethro CC Kwong, Lauren Erdman, Adree Khondker, Marta Skreta, Anna Goldenberg, Melissa D McCradden, Ar- mando J Lorenzo, and Mandy Rickard. 2022. The silent trial-the bridge between bench-to-bedside clinical AI applications.Frontiers in digital health4 (2022), 929508

  72. [72]

    Himanshi Lalwani and Hanan Salam. 2026. The Supportiveness-Safety Tradeoff in LLM Well-Being Agents.arXiv preprint arXiv:2602.04487(2026). 28 Li et al

  73. [73]

    Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, and Jimmy Huang. 2025. Improving automatic evaluation of large language models (LLMs) in biomedical relation extraction via LLMs-as- the-judge. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 25483–25497

  74. [74]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

  75. [75]

    A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework

    Chenyu Li, Zohaib Akhtar, Mingu Kwak, Yuelyu Ji, Hang Zhang, Tracey Obi, Yufan Ren, Xizhi Wu, Sonish Sivarajku- mar, Harold P. Lehmann, Shyam Visweswaran, Michael J. Becich, Danielle L. Mowery, Renxuan Liu, Haoyang Sun, and Yanshan Wang. 2026. A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework.arXiv preprint arXiv:2604.25933(2026)...

  76. [76]

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. 2025. From generation to judgment: Opportunities and challenges of llm-as-a-judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2757–2791

  77. [77]

    Haitao Li, Junjie Chen, Qingyao Ai, Zhumin Chu, Yujia Zhou, Qian Dong, and Yiqun Liu. 2025. Calibraeval: Calibrating prediction distribution to mitigate selection bias in llms-as-judges. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 16537–16552

  78. [78]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579(2024)

  79. [79]

    Hanyang Li, Xiao He, Adarsh Subbaswamy, Patrick Vossler, Alexej Gossmann, Karandeep Singh, and Jean Feng. 2026. Scaling medical device regulatory science using large language models.npj Digital Medicine(2026)

  80. [80]

    Kunning Li, Jianbin Guo, Zhaoyang Shang, Yiqing Liu, Hongmin Du, Lingling Liu, Yuping Zhao, and Lifeng Dong

Showing first 80 references.