CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overestimates reliability.
npj Digital Medicine , year=
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning
CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overestimates reliability.