arXiv preprint arXiv:2602.10367 , year=

Livemedbench: A contamination-free medical benchmark for llms with automated rubric evaluation , author= · 2026 · arXiv 2602.10367

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

When Medical Safety Alignment Fails: A Benchmark for Evaluating LLMs on High-Risk Medical Queries

cs.CY · 2026-05-26 · unverdicted · novelty 6.0

MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.

LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

cs.CY · 2026-05-24 · unverdicted · novelty 6.0

Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.

citing papers explorer

Showing 2 of 2 citing papers.

When Medical Safety Alignment Fails: A Benchmark for Evaluating LLMs on High-Risk Medical Queries cs.CY · 2026-05-26 · unverdicted · none · ref 11
MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.
LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment cs.CY · 2026-05-24 · unverdicted · none · ref 163
Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.

arXiv preprint arXiv:2602.10367 , year=

fields

years

verdicts

representative citing papers

citing papers explorer