Physician oversight reveals high error rates in LLM-generated labels for a clinical benchmark and demonstrates that corrected labels improve both evaluation accuracy and downstream model training.
Brazelton, Margaret A
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.AI 2representative citing papers
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
citing papers explorer
-
Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight
Physician oversight reveals high error rates in LLM-generated labels for a clinical benchmark and demonstrates that corrected labels improve both evaluation accuracy and downstream model training.
-
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.