Adding medically insignificant features to prompts causes statistically significant increases in mean predicted hospitalization risk and output variability across four LLMs and four prompt styles on synthetic patient profiles.
Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method , volume =
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
Adding medically insignificant features to prompts causes statistically significant increases in mean predicted hospitalization risk and output variability across four LLMs and four prompt styles on synthetic patient profiles.