Adding medically insignificant features to prompts causes statistically significant increases in mean predicted hospitalization risk and output variability across four LLMs and four prompt styles on synthetic patient profiles.
Comparingtheaccuracyoflargelanguagemodels andpromptengineeringindiagnosingrealworldcases
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.
citing papers explorer
-
Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
Adding medically insignificant features to prompts causes statistically significant increases in mean predicted hospitalization risk and output variability across four LLMs and four prompt styles on synthetic patient profiles.
-
Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models
Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.