arxiv: 2604.22063 · v2 · submitted 2026-04-23 · 💻 cs.LG · cs.AI

Recognition: unknown

Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

Shevya Panda , Shinjini Bose , Ananya Joshi

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM auditingpsychiatric risk assessmenthospitalization predictionprompt sensitivitysynthetic patient profilesclinical reliabilitymodel stabilitynon-clinical features

0 comments

The pith

Including medically insignificant features raises the average hospitalization risk predicted by LLMs and increases output variability across models and prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how reliable large language models are when generating hospitalization risk scores for psychiatric patients. It creates synthetic profiles with relevant medical details plus varying amounts of irrelevant ones and tests four LLMs under different prompt styles. Results indicate that adding those irrelevant details leads to higher average risk predictions and greater inconsistency in the outputs. This sensitivity to non-clinical information points to potential issues in using such models for clinical decisions without further checks. The work proposes this kind of structured auditing as a way to evaluate reliability before deployment in psychiatry.

Core claim

By auditing four LLMs on synthetic psychiatric patient profiles containing 15 clinically relevant features and up to 50 insignificant ones across four prompt reframings, the study establishes that the inclusion of medically insignificant variables produces a statistically significant rise in both the absolute mean predicted hospitalization risk and the variability of model outputs.

What carries the argument

A structured reliability audit that measures the effect of prompt design and the addition of medically insignificant inputs on the stability of predicted hospitalization risk scores.

If this is right

The addition of insignificant features increases instability across many model-prompt combinations.
Different prompt variations change the pattern of instability depending on the specific model.
LLM-based psychiatric risk assessments show clear sensitivity to non-clinical contextual information.
Systematic evaluations of attributional stability and uncertainty are required prior to any clinical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar audits could be applied to other clinical tasks like diagnosis or treatment recommendation to check for the same issues.
If confirmed, this might imply that LLMs require input sanitization or explicit uncertainty reporting when used in medicine.
The findings could extend to real-world data, where patient records often contain extraneous details that might influence model behavior.

Load-bearing premise

The definition of clinically insignificant features is taken as objective, and the synthetic profiles are assumed to reflect how real psychiatric cases would respond to such additions.

What would settle it

A replication using actual psychiatric patient records instead of synthetic ones, where adding the same insignificant features produces no significant change in risk scores or variability, would falsify the central finding.

Figures

Figures reproduced from arXiv: 2604.22063 by Ananya Joshi, Shevya Panda, Shinjini Bose.

**Figure 1.** Figure 1: Experimental pipeline for evaluating LLM sensitivity to clinically insignificant features. A synthetic cohort of 50 patient profiles was evaluated across four large language models and four prompt styles, with each patient profile systematically amended from a 15-feature clinical baseline to a maximum of 65 features through incremental addition of clinically insignificant variables in batches of five, crea… view at source ↗

**Figure 2.** Figure 2: Mean instability with 95% CI by prompt type. Instability rises most significantly at the first addition of insignificant variables and is on average higher at the logical prompt framing and lowest at the clinical judgment prompt framing. 5.2. Added Features as a Driver of Instability Models also demonstrate an increase in mean risk score instability with the addition of clinically irrelevant variables, as … view at source ↗

**Figure 3.** Figure 3: Mean Instability heatmap by model, prompt, and perturbation level [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The audit shows LLMs shift psychiatric risk scores when author-labeled irrelevant features are added, but the result rests on unvalidated feature labels and synthetic profiles.

read the letter

This paper runs a controlled audit on four LLMs for psychiatric hospitalization risk. It generates 50 synthetic profiles with 15 core clinical features plus up to 50 features the authors label insignificant, then measures how mean predicted risk and output variance change across neutral, logical, human-impact, and clinical-judgment prompts. The main finding is that adding those extra features raises both the absolute mean risk and variability in a statistically significant way, with prompt style modulating the effect differently per model. That concrete noise-injection design for a downstream clinical task is the clearest new piece here. It gives a replicable way to quantify attributional instability beyond generic prompt-sensitivity tests. The sweep across models and prompt variants is direct and useful for anyone tracking deployment risks in mental-health settings. The soft spots are exactly where the stress-test note flags them. The abstract gives no method, literature reference, or inter-rater check for why the added features count as clinically insignificant; if any carry even modest real signal, the instability claim weakens. The synthetic profiles also lack reported validation against actual patient distributions or conditional dependencies, so the observed sensitivity could partly reflect how the generator couples variables rather than pure LLM behavior. The abstract mentions significance but omits effect sizes, the exact test, and any correction for multiple comparisons, which leaves the practical magnitude hard to judge. This is the sort of empirical check clinical AI needs, and the protocol itself is worth adapting. A reader focused on safeguards for LLM risk tools in psychiatry would get value from it. It deserves peer review so the feature-labeling process and synthetic-data realism can be examined directly.

Referee Report

3 major / 2 minor

Summary. The paper proposes a reliability auditing framework for LLM-based psychiatric risk assessment by measuring how predicted hospitalization risk scores change when up to 50 author-designated 'clinically insignificant' features are added to 50 synthetic patient profiles (each with 15 core features). Four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini) are tested under four prompt framings; the central empirical claim is that adding the insignificant features produces statistically significant rises in absolute mean risk and output variability across all model-prompt combinations, demonstrating reduced stability to non-clinical context.

Significance. If substantiated, the work is significant for AI safety in clinical NLP because it supplies a concrete, replicable audit protocol that quantifies sensitivity to extraneous inputs in a high-stakes domain. The multi-model, multi-prompt design and focus on downstream risk scoring rather than abstract benchmarks are strengths; the purely empirical approach avoids circular derivations and offers falsifiable predictions about instability growth with context length.

major comments (3)

[Abstract] Abstract: the headline result ('statistically significant increase in the absolute mean predicted hospitalization risk and output variability') is reported without naming the test statistic, effect-size measure, exact p-values, or any multiple-comparison correction across the 16 model-prompt conditions. With n=50 synthetic cases, these omissions prevent assessment of whether the claimed significance is robust or an artifact of small-sample or uncorrected testing.
[Abstract] Abstract / Methods (implied): the classification of features as 'clinically insignificant' is presented as author judgment with no inter-rater protocol, no citations to psychiatric literature, and no empirical check (e.g., correlation with real-world risk labels). Because the entire instability claim rests on these features carrying zero medical signal, the lack of external grounding is load-bearing and directly matches the stress-test concern.
[Abstract] Abstract: the synthetic-profile generator is described only at the level of 'n=50' and '15 clinically relevant + up to 50 insignificant features'; no validation is supplied that the generated cases preserve the conditional dependencies present in real psychiatric records. Any coupling between the added variables and the core 15 features could produce the observed risk shifts without implying LLM instability.

minor comments (2)

[Abstract] The abstract does not state the exact increments or total counts of insignificant features tested (e.g., 0, 10, 25, 50), which obscures whether the instability effect is monotonic or threshold-based.
Prompt templates and exact feature lists are not reproduced; providing them (or a repository link) would improve reproducibility of the audit protocol.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for providing a thorough review that identifies key areas where our manuscript can be improved for clarity and scientific rigor. We have prepared point-by-point responses to the major comments and will incorporate revisions to address the concerns raised regarding statistical reporting, feature justification, and synthetic data validation.

read point-by-point responses

Referee: [Abstract] Abstract: the headline result ('statistically significant increase in the absolute mean predicted hospitalization risk and output variability') is reported without naming the test statistic, effect-size measure, exact p-values, or any multiple-comparison correction across the 16 model-prompt conditions. With n=50 synthetic cases, these omissions prevent assessment of whether the claimed significance is robust or an artifact of small-sample or uncorrected testing.

Authors: We agree that the abstract would benefit from explicit statistical details to allow readers to evaluate the robustness of the findings. In the revised manuscript we will expand the abstract to name the tests (paired t-tests for changes in mean risk scores and Levene's test for changes in output variability), report effect sizes (Cohen's d), provide the range of exact p-values, and state that Bonferroni correction was applied across the 16 model-prompt conditions. These details are already computed and presented in the Results section; the revision will simply summarize them concisely in the abstract. revision: yes
Referee: [Abstract] Abstract / Methods (implied): the classification of features as 'clinically insignificant' is presented as author judgment with no inter-rater protocol, no citations to psychiatric literature, and no empirical check (e.g., correlation with real-world risk labels). Because the entire instability claim rests on these features carrying zero medical signal, the lack of external grounding is load-bearing and directly matches the stress-test concern.

Authors: We acknowledge that the initial presentation relied on author expertise without formal external validation. We will revise the Methods section to include citations to psychiatric literature on established risk factors for hospitalization (e.g., studies identifying symptom severity, prior admissions, and substance use as primary predictors while excluding unrelated personal attributes). The complete list of the 50 insignificant features will be provided in an appendix. We will also add an explicit limitation statement noting the absence of a multi-clinician inter-rater protocol. Because the synthetic risk labels were assigned solely from the 15 core features, no post-hoc correlation check between insignificant features and risk labels is possible or necessary; the generation process already enforces independence. revision: partial
Referee: [Abstract] Abstract: the synthetic-profile generator is described only at the level of 'n=50' and '15 clinically relevant + up to 50 insignificant features'; no validation is supplied that the generated cases preserve the conditional dependencies present in real psychiatric records. Any coupling between the added variables and the core 15 features could produce the observed risk shifts without implying LLM instability.

Authors: We agree that explicit validation of feature independence is required to rule out data-generation artifacts. The generator samples the 15 core features from joint distributions derived from psychiatric literature (preserving realistic conditional dependencies among them) while drawing the insignificant features from independent marginal distributions with no dependence on core features or assigned risk. In the revised Methods we will add a full description of the sampling procedure together with a validation table demonstrating zero Pearson and Spearman correlations between the added features and both the core features and the synthetic risk labels. This documentation will confirm that the observed LLM output changes cannot be attributed to unintended coupling in the synthetic profiles. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical audit with no derivations or self-referential reductions

full rationale

The manuscript describes an empirical reliability audit: synthetic profiles (n=50) are generated with 15 core features plus up to 50 author-labeled insignificant features, then fed to four LLMs under four prompt variants; hospitalization-risk outputs are measured for changes in mean and variance. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems appear. The sole potential point of contention—the authors’ designation of features as “clinically insignificant”—is an explicit experimental input rather than a result derived from the data or from prior self-citations. All reported statistical increases are direct observations of model behavior, not reductions to the classification itself. The work is therefore self-contained against external benchmarks and receives a circularity score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on two untested premises: that the chosen insignificant features truly carry no clinical signal, and that synthetic profiles are sufficient proxies for real psychiatric cases. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Synthetic patient profiles with author-defined clinically relevant and insignificant features can be used to evaluate LLM stability in psychiatric risk assessment
The entire experiment is built on generated data rather than real records.

pith-pipeline@v0.9.0 · 5594 in / 1282 out tokens · 29207 ms · 2026-05-09T22:03:23.281690+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 20 canonical work pages

[1]

What are large language models? , year =
[2]

Opportunities and challenges of artificial intelligence in the medical field: current application, emerging problems, and problem-solving strategies , volume =

Jiang, Lushun and Wu, Zhe and Xu, Xiaolan and Zhan, Yaqiong and Jin, Xuehang and Wang, Li and Qiu, Yunqing , year =. Opportunities and challenges of artificial intelligence in the medical field: current application, emerging problems, and problem-solving strategies , volume =. Journal of International Medical Research , publisher =. doi:10.1177/0300060521...

work page doi:10.1177/03000605211000157
[3]

and Saba, Luca and Hadamitzky, Martin and Kather, Jakob Nikolas and Truhn, Daniel and Cuocolo, Renato and Adams, Lisa C

Busch, Felix and Hoffmann, Lena and Rueger, Christopher and van Dijk, Elon HC and Kader, Rawen and Ortiz-Prado, Esteban and Makowski, Marcus R. and Saba, Luca and Hadamitzky, Martin and Kather, Jakob Nikolas and Truhn, Daniel and Cuocolo, Renato and Adams, Lisa C. and Bressem, Keno K. , year =. Current applications and challenges in large language models ...

work page doi:10.1038/s43856-024-00717-2
[4]

Jiang, Lavender Yao and Liu, Xujin Chris and Nejatian, Nima Pour and Nasir-Moin, Mustafa and Wang, Duo and Abidin, Anas and Eaton, Kevin and Riina, Howard Antony and Laufer, Ilya and Punjabi, Paawan and Miceli, Madeline and Kim, Nora C. and Orillac, Cordelia and Schnurman, Zane and Livia, Christopher and Weiss, Hannah and Kurland, David and Neifert, Sean ...

work page doi:10.1038/s41586-023-06160-y
[5]

Perlis, Joseph F

Perlis, Roy H. and Goldberg, Joseph F. and Ostacher, Michael J. and Schneck, Christopher D. , year =. Clinical decision support for bipolar depression using large language models , volume =. Neuropsychopharmacology , publisher =. doi:10.1038/s41386-024-01841-2 , number =

work page doi:10.1038/s41386-024-01841-2
[6]

Large language models-powered clinical decision support: enhancing or replacing human expertise? , journal =

Li, Jia and Zhou, Zichun and Lyu, Han and Wang, Zhenchang , year =. Large language models-powered clinical decision support: enhancing or replacing human expertise? , volume =. Intelligent Medicine , publisher =. doi:10.1016/j.imed.2025.01.001 , number =

work page doi:10.1016/j.imed.2025.01.001 2025
[7]

Proceedings of the 9th Python in Science Conference , year =

statsmodels: Econometric and statistical modeling with python , author =. Proceedings of the 9th Python in Science Conference , year =
[8]

and Lester, Jenna C

Omiye, Jesutofunmi A. and Lester, Jenna C. and Spichak, Simon and Rotemberg, Veronica and Daneshjou, Roxana , year =. Large language models propagate race-based medicine , volume =. npj Digital Medicine , publisher =. doi:10.1038/s41746-023-00939-z , number =

work page doi:10.1038/s41746-023-00939-z
[9]

and Balakrishnan, Karthik and Ayoub, Marc S

Ayoub, Noel F. and Balakrishnan, Karthik and Ayoub, Marc S. and Barrett, Thomas F. and David, Abel P. and Gray, Stacey T. , year =. Inherent Bias in Large Language Models: A Random Sampling Analysis , volume =. Mayo Clinic Proceedings: Digital Health , publisher =. doi:10.1016/j.mcpdig.2024.03.003 , number =

work page doi:10.1016/j.mcpdig.2024.03.003 2024
[10]

arXiv preprint arXiv:2410.07304 , year =

The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making , author =. arXiv preprint arXiv:2410.07304 , year =. 2410.07304 , archivePrefix =

work page arXiv
[11]

Embedded values-like shape ethical reasoning of large language models on primary care ethical dilemmas , volume =

Hadar-Shoval, Dorit and Asraf, Kfir and Shinan-Altman, Shiri and Elyoseph, Zohar and Levkovich, Inbar , year =. Embedded values-like shape ethical reasoning of large language models on primary care ethical dilemmas , volume =. Heliyon , publisher =. doi:10.1016/j.heliyon.2024.e38056 , number =

work page doi:10.1016/j.heliyon.2024.e38056 2024
[12]

and Bullock, Daniel H

Johnson, Jeff A. and Bullock, Daniel H. , year =. Fragility in AIs Using Artificial Neural Networks , volume =. Communications of the ACM , publisher =. doi:10.1145/3571280 , number =

work page doi:10.1145/3571280
[13]

Ghorbani, A

Ghorbani, Amirata and Abid, Abubakar and Zou, James , year =. Interpretation of Neural Networks Is Fragile , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , publisher =. doi:10.1609/aaai.v33i01.33013681 , number =

work page doi:10.1609/aaai.v33i01.33013681
[14]

Artificial intelligence in mental health: integrating opportunities and challenges of multimodal deep learning for mental disorder prevention and treatment , volume =

Narimani, Mohammad and Naeim, Mahdi , year =. Artificial intelligence in mental health: integrating opportunities and challenges of multimodal deep learning for mental disorder prevention and treatment , volume =. Annals of Medicine and Surgery , publisher =. doi:10.1097/ms9.0000000000003624 , number =

work page doi:10.1097/ms9.0000000000003624
[15]

StatPearls , publisher =

Mood Disorder , author =. StatPearls , publisher =. 2023 , month =

2023
[16]

Future Challenges for the Diagnosis and Management of Affective Disorders: From Preclinical Evidence to Clinical Trials , volume =

Luciano, Mario and Ventriglio, Antonio and Fiorillo, Andrea , year =. Future Challenges for the Diagnosis and Management of Affective Disorders: From Preclinical Evidence to Clinical Trials , volume =. Brain Sciences , publisher =. doi:10.3390/brainsci15050489 , number =

work page doi:10.3390/brainsci15050489
[17]

and Katsiferis, Alexandros and Elsenburg, Leonie K

Cronjé, Héléne T. and Katsiferis, Alexandros and Elsenburg, Leonie K. and Andersen, Thea O. and Rod, Naja H. and Nguyen, Tri-Long and Varga, Tibor V. , editor =. Assessing racial bias in type 2 diabetes risk prediction algorithms , volume =. PLOS Global Public Health , publisher =. 2023 , month = may, pages =. doi:10.1371/journal.pgph.0001556 , number =

work page doi:10.1371/journal.pgph.0001556 2023
[18]

Structured clinical reasoning prompt enhances LLM 's diagnostic capabilities in diagnosis please quiz cases

Sonoda, Yuki and Kurokawa, Ryo and Hagiwara, Akifumi and Asari, Yusuke and Fukushima, Takahiro and Kanzawa, Jun and Gonoi, Wataru and Abe, Osamu , year =. Structured clinical reasoning prompt enhances LLM’s diagnostic capabilities in diagnosis please quiz cases , ISSN =. doi:10.1007/s11604-024-01712-2 , journal =

work page doi:10.1007/s11604-024-01712-2
[19]

, year =

Dymm, Braydon and Goldenholz, Daniel M. , year =. Prompting is All You Need: How to Make LLMs More Helpful for Clinical Decision Support , url =. doi:10.64898/2026.02.12.26346005 , publisher =

work page doi:10.64898/2026.02.12.26346005 2026
[20]

Comparingtheaccuracyoflargelanguagemodels andpromptengineeringindiagnosingrealworldcases

Yao, Guanhong and Zhang, WuJi and Zhu, Yingxi and Wong, Ut-kei and Zhang, Yanfeng and Yang, Cui and Shen, Guanghao and Li, Zhanguo and Gao, Hui , year =. Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases , volume =. doi:10.1016/j.ijmedinf.2025.106026 , journal =

work page doi:10.1016/j.ijmedinf.2025.106026 2025
[21]

Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method , volume =

Sarvari, Peter and Al-fagih, Zaid , year =. Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method , volume =. doi:10.2196/67661 , journal =

work page doi:10.2196/67661
[22]

Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases , volume =

Dinc, Mehmed T and Bardak, Ali E and Bahar, Furkan and Noronha, Craig , year =. Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases , volume =. JAMIA Open , publisher =. doi:10.1093/jamiaopen/ooaf055 , number =

work page doi:10.1093/jamiaopen/ooaf055
[23]

2025 , month =

Amit Khandelwal , title =. 2025 , month =

2025
[24]

Explicitly unbiased large language models still form biased associations,

Bai, Xuechunzi and Wang, Angelina and Sucholutsky, Ilia and Griffiths, Thomas L. , year =. Explicitly unbiased large language models still form biased associations , volume =. Proceedings of the National Academy of Sciences , publisher =. doi:10.1073/pnas.2416228122 , number =

work page doi:10.1073/pnas.2416228122