Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple-choice questions: Cross-sectional study,

· 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

cs.AI · 2026-03-10 · conditional · novelty 6.0

Repeated sampling of the same safety prompts reveals substantial differences in LLM failure probabilities across temperatures that conventional single-evaluation benchmarks miss.

citing papers explorer

Showing 1 of 1 citing paper.

Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling cs.AI · 2026-03-10 · conditional · none · ref 4
Repeated sampling of the same safety prompts reveals substantial differences in LLM failure probabilities across temperatures that conventional single-evaluation benchmarks miss.

Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple-choice questions: Cross-sectional study,

fields

years

verdicts

representative citing papers

citing papers explorer