Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing

Akanksha Dadlani; Darja Djordjevic; Duncan Eddy; Eugenia Kim; Kiana Jafari; Max Lamparth; Mykel Kochenderfer; Nina Vasan; Paul Ulrich Nikolaus Rust; Robbie Fraser

arxiv: 2601.18061 · v3 · submitted 2026-01-26 · 💻 cs.AI · cs.HC

Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing

Kiana Jafari , Paul Ulrich Nikolaus Rust , Duncan Eddy , Robbie Fraser , Nina Vasan , Darja Djordjevic , Akanksha Dadlani , Max Lamparth

show 2 more authors

Eugenia Kim Mykel Kochenderfer

This is my paper

Pith reviewed 2026-05-16 11:43 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords mental health AIexpert evaluationhuman feedbackinter-rater reliabilityAI safetyLLM responsesclinical disagreementsuicide risk assessment

0 comments

The pith

Aggregated expert judgments in mental health AI safety testing erase distinct clinical philosophies and yield unreliable ground truth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the core assumption in learning from human feedback that averaging expert ratings produces valid training and evaluation data for AI. Three psychiatrists applied a shared rubric to LLM-generated mental health responses and showed consistently low inter-rater reliability, with the worst scores on suicide and self-harm items. Qualitative interviews showed that the disagreement arose from coherent but incompatible individual clinical orientations rather than random error or poor training. Because these orientations reflect real professional judgment, simple averaging creates compromise labels that discard the underlying reasoning. In safety-critical domains this undermines the use of consensus-based human feedback for reward models and benchmarks.

Core claim

Aggregated expert labels function as arithmetic compromises that effectively erase grounded professional philosophies. Expert disagreement in safety-critical AI is a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence rather than measurement error.

What carries the argument

Inter-rater reliability statistics (ICC and Krippendorff's alpha) paired with qualitative analysis of psychiatrist interviews that map responses to distinct clinical frameworks such as safety-first, engagement-centered, and culturally-informed orientations.

If this is right

Reward modeling for mental health AI must treat expert disagreement as structured signal rather than noise to be averaged away.
Safety classification benchmarks should move away from single consensus labels toward representations that keep separate expert frameworks visible.
Evaluation protocols need methods that learn from multiple professional heuristics instead of forcing arithmetic agreement.
Training pipelines that preserve and model individual expert philosophies could produce AI systems better aligned with real clinical practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structured divergence may appear in other high-stakes expert domains such as legal review or medical diagnosis AI.
Alignment techniques could be developed to train separate models on each coherent expert framework rather than a single averaged dataset.
Practitioners might test whether AI performance on safety tasks improves when models are exposed to disaggregated expert labels during training.

Load-bearing premise

The disagreement patterns seen with these three psychiatrists and this rubric would appear similarly with other experts and different evaluation instruments.

What would settle it

A replication study with a larger panel of psychiatrists evaluating comparable LLM responses that reports high inter-rater reliability (ICC above 0.7) across safety-critical items would falsify the central claim.

read the original abstract

Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ($ICC$ $0.087$--$0.295$), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff's $\alpha = -0.203$), indicating structured disagreement worse than chance. Qualitative interviews revealed that disagreement reflects coherent but incompatible individual clinical frameworks, safety-first, engagement-centered, and culturally-informed orientations, rather than measurement error. By demonstrating that experts rely on holistic risk heuristics rather than granular factor discrimination, these findings suggest that aggregated labels function as arithmetic compromises that effectively erase grounded professional philosophies. Our results characterize expert disagreement in safety-critical AI as a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence. We discuss implications for reward modeling, safety classification, and evaluation benchmarks, recommending that practitioners shift from consensus-based aggregation to alignment methods that preserve and learn from expert disagreement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows structured disagreement among three psychiatrists on mental health AI safety items, traced to distinct clinical frameworks, but the tiny rater pool leaves the reliability stats shaky.

read the letter

The main point is that three psychiatrists rating LLM responses on a mental health safety rubric produced low agreement, with ICCs from 0.087 to 0.295 and negative Krippendorff alpha on one factor, and the authors link this to coherent differences in approach rather than random noise. They identify safety-first, engagement-centered, and culturally-informed orientations from follow-up interviews, with the biggest splits on suicide and self-harm items. This is new in the AI safety literature, where we often treat expert labels as straightforward ground truth for LHF pipelines. The paper does well by moving past raw reliability numbers to interpret the source of divergence and by flagging concrete consequences for reward modeling and benchmarks in high-stakes domains. The qualitative piece adds useful texture that pure stats would miss. The soft spot is the scale. With only three raters and no reported count of responses or items in the abstract, those coefficients carry large uncertainty and could shift with one or two outliers or rubric tweaks. The jump from this sample to the broader claim that aggregation erases grounded philosophies needs more data to land solidly. Readers working on safety evaluation for clinical or sensitive AI applications will find the numbers and framework discussion worth considering as a caution. It deserves peer review because the question it raises about expert consensus is real and timely, even if the current evidence is preliminary and would benefit from larger samples and fuller methods reporting.

Referee Report

3 major / 2 minor

Summary. The manuscript examines the limits of aggregated human feedback in mental health AI safety testing by having three psychiatrists evaluate LLM-generated responses using a calibrated rubric. It reports poor inter-rater reliability (ICC ranging from 0.087 to 0.295, with one factor showing Krippendorff's α = -0.203) and uses qualitative interviews to argue that disagreements stem from coherent but incompatible clinical frameworks (safety-first, engagement-centered, culturally-informed), rather than random error. The central claim is that aggregated labels erase grounded professional philosophies, and expert disagreement in safety-critical AI should be treated as a sociotechnical phenomenon requiring alignment methods that preserve divergence.

Significance. If the results are robust, this work is significant for challenging core assumptions in RLHF and LHF for high-stakes domains. It provides empirical evidence that expert consensus may not be appropriate for mental health safety evaluation, with implications for reward modeling, safety classification, and benchmarks. The combination of quantitative reliability metrics and qualitative insights into frameworks is a strength, highlighting the need for methods that learn from disagreement.

major comments (3)

[Methods] The study relies on only three certified psychiatrists as raters. With such a small sample, the ICC and Krippendorff's alpha estimates are subject to high sampling variability; the negative alpha could arise from a single outlier or specific rubric features rather than indicating structured disagreement worse than chance.
[Results] No information is provided on the total number of LLM responses evaluated or the number of rubric items per category. This omission prevents assessment of whether the reported disagreement patterns (e.g., highest on suicide/self-harm) have sufficient statistical power or are generalizable beyond the specific sample.
[Discussion] The interpretation that disagreement reflects 'coherent but incompatible individual clinical frameworks' is based on qualitative interviews, but the manuscript does not provide quantitative evidence, such as correlation between rater frameworks and rating patterns or inter-rater agreement within framework groups, to support that the divergence is principled rather than due to other sources of variance.

minor comments (2)

[Abstract] The range for ICC is given as 0.087–0.295, but it is unclear which specific factors or items correspond to the lower and upper bounds; specifying this would improve clarity.
[Introduction] The paper could benefit from citing more prior work on inter-rater reliability in clinical psychology or AI safety evaluations to contextualize the findings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us improve the clarity and robustness of our manuscript. We address each of the major comments below.

read point-by-point responses

Referee: [Methods] The study relies on only three certified psychiatrists as raters. With such a small sample, the ICC and Krippendorff's alpha estimates are subject to high sampling variability; the negative alpha could arise from a single outlier or specific rubric features rather than indicating structured disagreement worse than chance.

Authors: We acknowledge the limitation of having only three raters, which does introduce potential for high sampling variability in the reliability metrics. However, the fact that we observed consistently low ICC values across different factors and a negative alpha on a key safety item points to systematic disagreement. To strengthen this, we have added bootstrap resampling to estimate confidence intervals for the ICC and alpha values in the revised Methods and Results sections. We also examined the data for outliers and confirmed that the negative alpha persists even after sensitivity checks. revision: partial
Referee: [Results] No information is provided on the total number of LLM responses evaluated or the number of rubric items per category. This omission prevents assessment of whether the reported disagreement patterns (e.g., highest on suicide/self-harm) have sufficient statistical power or are generalizable beyond the specific sample.

Authors: We have revised the manuscript to include the total number of LLM responses evaluated and the number of rubric items per category in the Methods section. This allows for assessment of statistical power and generalizability of the disagreement patterns. revision: yes
Referee: [Discussion] The interpretation that disagreement reflects 'coherent but incompatible individual clinical frameworks' is based on qualitative interviews, but the manuscript does not provide quantitative evidence, such as correlation between rater frameworks and rating patterns or inter-rater agreement within framework groups, to support that the divergence is principled rather than due to other sources of variance.

Authors: The qualitative component was intended to provide interpretive depth to the quantitative reliability findings. We have expanded the Discussion to include additional quotes from the interviews that directly map each psychiatrist's framework to their specific rating behaviors on the rubric items. While the small number of raters precludes formal quantitative correlation analyses, the alignment between self-described frameworks and observed rating patterns is evident in the data. We have added a supplementary table illustrating this mapping. revision: partial

Circularity Check

0 steps flagged

Empirical measurement study with no derivation chain or fitted predictions

full rationale

The paper reports an empirical study collecting ratings from three psychiatrists on LLM responses, then applies standard external statistics (ICC, Krippendorff’s α) to the new data and supplements with qualitative interviews. No equations, derivations, or predictions are claimed; the central claims about disagreement patterns follow directly from the observed reliability coefficients and interview content without reducing to fitted parameters or self-referential inputs by construction. No self-citations are load-bearing for any result. This is a normal non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the three psychiatrists represent expert judgment and that the rubric captures relevant safety dimensions; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Expert clinical judgment can be elicited via a calibrated rubric and aggregated or compared using standard reliability statistics.
Invoked when interpreting ICC values as evidence of poor consensus.

pith-pipeline@v0.9.0 · 5582 in / 1236 out tokens · 19938 ms · 2026-05-16T11:43:20.365295+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

[1]

Clinician-Rated Severity of Nonsuicidal Self-Injury

American Psychiatric Association. Clinician-Rated Severity of Nonsuicidal Self-Injury. https://www.psychiatry.org/File%20Library/Psychiatrists/ Practice/DSM/APA_DSM5_Clinician-Rated-Severity-of-Non-Suicidal-Self-Injury.pdf, 2013. DSM-5 Emerging Measure; accessed 2026-01-13

work page 2013
[2]

DSM-5 Clinician-Rated Dimensions of Psychosis Symptom Severity

American Psychiatric Association. DSM-5 Clinician-Rated Dimensions of Psychosis Symptom Severity. https://www.psychiatry.org/File%20Library/ Psychiatrists/Practice/DSM/APA_DSM5_Clinician-Rated-Dimensions-of-Psychosis-Symptom-Severity.pdf, 2013. Accessed: 2026-01-13.©2013 American Psychiatric Association; reproduced with permission for clinical/research use

work page 2013
[3]

DICES Dataset: Diversity in Conversational AI Evaluation for Safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

Lora Aroyo, Alex Taylor, Mark Díaz, Christopher Homan, Alicia Parrish, Gregory Serapio-García, Vinodkumar Prabhakaran, and Ding Wang. DICES Dataset: Diversity in Conversational AI Evaluation for Safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

work page 2023
[4]

Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine, 36(1):15–24, 2015

Lora Aroyo and Chris Welty. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine, 36(1):15–24, 2015

work page 2015
[5]

Crowd Truth: Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard

Lora Aroyo and Christopher Welty. Crowd Truth: Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard. InACM Web Science Conference, 2013

work page 2013
[6]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page 2022
[7]

Bech.Clinical Psychometrics

P. Bech.Clinical Psychometrics. John Wiley & Sons, 2nd edition, 2012

work page 2012
[8]

A. T. Beck, A. Weissman, D. Lester, and L. Trexler. The measurement of pessimism: the hopelessness scale.Journal of Consulting and Clinical Psychology, 42(6):861–865, dec 1974

work page 1974
[9]

Consensus report of the apa work group on neuroimaging markers of psychiatric disorders.Am Psychiatr Assoc, 2012

Kelly Botteron, Cameron Carter, Francisco Xavier Castellanos, Daniel P Dickstein, Wayne Drevets, Kerri L Kim, Matthew F Pescosolido, Scott Rausch, Karen E Seymour, Yvette Sheline, et al. Consensus report of the apa work group on neuroimaging markers of psychiatric disorders.Am Psychiatr Assoc, 2012

work page 2012
[10]

Using Thematic Analysis in Psychology.Qualitative Research in Psychology, 3(2):77–101, 2006

Virginia Braun and Victoria Clarke. Using Thematic Analysis in Psychology.Qualitative Research in Psychology, 3(2):77–101, 2006

work page 2006
[11]

Minton, Abigail Lott, and Jinho D

Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, and Jinho D. Choi. CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection, 2025

work page 2025
[12]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.arXiv preprint, 2023

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.arXiv preprint, 2023

work page 2023
[13]

How people use chatgpt

Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Technical report, National Bureau of Economic Research, 2025

work page 2025
[14]

Predicting Depression via Social Media.International AAAI Conference on Web and Social Media, 7(1):128–137, 2013

Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. Predicting Depression via Social Media.International AAAI Conference on Web and Social Media, 7(1):128–137, 2013

work page 2013
[15]

Deep Reinforcement Learning from Human Preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[16]

Cicchetti

Domenic V. Cicchetti. Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology. Psychological Assessment, 6(4):284–290, 1994

work page 1994
[17]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint, 2023

work page 2023
[18]

Diagnostic and statistical manual of mental disorders.Am Psychiatric Assoc, 21(21):591–643, 2013

Fifth Edition et al. Diagnostic and statistical manual of mental disorders.Am Psychiatric Assoc, 21(21):591–643, 2013

work page 2013
[19]

C. G. Fairburn and S. J. Beglin. Eating Disorder Examination Questionnaire (EDE-Q).International Journal of Eating Disorders, 1994. DOI: 10.1037/t03974-000

work page doi:10.1037/t03974-000 1994
[20]

Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial.Journal of Medical Internet Research Mental Health, 4(2):e19, 2017

work page 2017
[21]

Can AI relate: Testing large language model response for mental health support

Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. Can AI relate: Testing large language model response for mental health support. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics, pages 2206–2221, Miami, Florida, USA, 2024. Association for Computational Lingu...

work page 2024
[22]

Impact of preference noise on the alignment performance of generative language models

Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of generative language models. InConference on Language Modeling, 2024

work page 2024
[23]

Blind spots and biases: Exploring the role of annotator cognitive biases in NLP

Sanjana Gautam and Mukund Srinath. Blind spots and biases: Exploring the role of annotator cognitive biases in NLP. InWorkshop on Bridging Human–Computer Interaction and Natural Language Processing, pages 82–88, Mexico City, Mexico, 2024. Association for Computational Linguistics

work page 2024
[24]

Goodman, Lawrence H

Wayne K. Goodman, Lawrence H. Price, Steven A. Rasmussen, Carolyn Mazure, Roberta L. Fleischmann, Candy L. Hill, George R. Heninger, and Dennis S. Charney. The Yale-Brown obsessive compulsive scale: I. Development, use, and reliability.Archives of General Psychiatry, 46(11):1006–1011, 1989

work page 1989
[25]

Gordon, Michelle S

Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S. Bernstein. Jury Learning: Integrating Dissenting Voices into Machine Learning Models. InCHI Conference on Human Factors in Computing Systems, pages 1–19, New York, NY, USA, 2022. Association for Computing Machinery

work page 2022
[26]

Risks from language models for automated mental healthcare: Ethics and structure for implementation

Declan Grabb, Max Lamparth, and Nina Vasan. Risks from language models for automated mental healthcare: Ethics and structure for implementation. InConference on Language Modeling, 2024

work page 2024
[27]

Human Feedback is not Gold Standard.arXiv preprint, 2024

Tom Hosking, Phil Blunsom, and Max Bartolo. Human Feedback is not Gold Standard.arXiv preprint, 2024

work page 2024
[28]

How LLM counselors violate ethical standards in mental health practice: A practitioner-informed framework

Zainab Iftikhar, Amy Xiao, Sean Ransom, Jeff Huang, and Harini Suresh. How LLM counselors violate ethical standards in mental health practice: A practitioner-informed framework. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1311–1323, 2025

work page 2025
[29]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint, 2023

work page 2023
[30]

Becky Inkster, Shubhankar Sarda, and Vinod Subramanian. An Empathy-Driven, Conversational Artificial Intelligence Agent (Wysa) for Digital Mental Well-Being: Real-World Data Evaluation Mixed-Methods Study.Journal of Medical Internet Research mHealth and uHealth, 6(11):e12106, 2018

work page 2018
[31]

R. E. Kendell.The Role of Diagnosis in Psychiatry. The Role of Diagnosis in Psychiatry. Blackwell Scientific Publications, Oxford, England, 1975. Pages: viii, 176

work page 1975
[32]

Reliability in Content Analysis: Some Common Misconceptions and Recommendations.Human Communication Research, 30(3):411–433, 2004

Klaus Krippendorff. Reliability in Content Analysis: Some Common Misconceptions and Recommendations.Human Communication Research, 30(3):411–433, 2004

work page 2004
[33]

Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, and Colleen Waickman

Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, and Colleen Waickman. Moving beyond medical exam questions: A clinician-annotated dataset of real-world tasks and ambiguity in mental healthcare.arXiv preprint, 2025

work page 2025
[34]

Matthew Large, Muthusamy Kaneson, Nicholas Myles, Hannah Myles, Pramudie Gunaratne, and Christopher Ryan. Meta-Analysis of Longitudinal Cohort Studies of Suicide Risk Assessment among Psychiatric Patients: Heterogeneity in Results and Lack of Improvement over Time.PloS One, 11(6):e0156322, 2016

work page 2016
[35]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023

work page 2023
[36]

Bunyi, Adam C

Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, and Ruishan Liu. CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering, 2025

work page 2025
[37]

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.Journal of Medical Internet Research AI, 3:e52095, 2024

Zoltan P Majdik, S Scott Graham, Jade C Shiva Edward, Sabrina N Rodriguez, Martha S Karnes, Jared T Jensen, Joshua B Barbour, and Justin F Rousseau. Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.Journal of Medical Internet Research AI, 3:e52095, 2024

work page 2024
[38]

A diagnostic meta-analysis of the patient health questionnaire-9 (phq-9) algorithm scoring method as a screen for depression.General hospital psychiatry, 37(1):67–75, 2015

Laura Manea, Simon Gilbody, and Dean McMillan. A diagnostic meta-analysis of the patient health questionnaire-9 (phq-9) algorithm scoring method as a screen for depression.General hospital psychiatry, 37(1):67–75, 2015

work page 2015
[39]

McGraw and S

Kenneth O. McGraw and S. P. Wong. Forming Inferences About Some Intraclass Correlation Coefficients.Psychological Methods, 1(1):30–46, 1996

work page 1996
[40]

Ong, and Nick Haber

Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. In2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, page 599–627, New York, NY, USA, 2025. Association for Computin...

work page 2025
[41]

Moyers, Lauren N

Theresa B. Moyers, Lauren N. Rowell, Jennifer K. Manuel, Denise Ernst, and Jon M. Houck. The Motivational Interviewing Treatment Integrity Code (MITI 4): Rationale, Preliminary Reliability and Validity.Journal of Substance Abuse Treatment, 65:36–42, 2016

work page 2016
[42]

Department of Veterans Affairs

National Center for PTSD, U.S. Department of Veterans Affairs. Clinician-Administered PTSD Scale for DSM-5 (CAPS-5): Past Week Version. https://www.ptsd.va.gov/professional/assessment/documents/CAPS_5_Past_Week.pdf, 2015. Assessment instrument; accessed 2026-01-13

work page 2015
[43]

NICHQ Vanderbilt Assessment Scales

National Institute for Children’s Health Quality (NICHQ). NICHQ Vanderbilt Assessment Scales. https://nichq.org/wp-content/uploads/2024/09/ NICHQ-Vanderbilt-Assessment-Scales.pdf, 2002. Assessment instrument; accessed 2026-01-13

work page 2024
[44]

Depression

National Institute of Mental Health. Depression. https://www.nimh.nih.gov/health/publications/depression, 2024. NIH Publication No. 24-MH-8079

work page 2024
[45]

Enhancing mental health with artificial intelligence: Current trends and future prospects.Journal of medicine, surgery, and public health, 3:100099, 2024

David B Olawade, Ojima Z Wada, Aderonke Odetayo, Aanuoluwapo Clement David-Olawade, Fiyinfoluwa Asaolu, and Judith Eberhardt. Enhancing mental health with artificial intelligence: Current trends and future prospects.Journal of medicine, surgery, and public health, 3:100099, 2024

work page 2024
[46]

Christiano, Jan Leike, and Ryan Lowe

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

work page 2022
[47]

Inherent Disagreements in Human Textual Inferences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019

Ellie Pavlick and Tom Kwiatkowski. Inherent Disagreements in Human Textual Inferences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019

work page 2019
[48]

Red Teaming Language Models with Language Models, 2022

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red Teaming Language Models with Language Models, 2022

work page 2022
[49]

Posner, D

K. Posner, D. Brent, C. Lucas, M. Gould, B. Stanley, G. Brown, P. Fisher, J. Zelazny, A. Burke, M. Oquendo, and J. Mann. Columbia-Suicide Severity Rating Scale (C-SSRS): Pediatric – Since Last Contact – Communities and Healthcare. https://cssrs.columbia.edu/wp-content/uploads/C- SSRS_Pediatric-SLC_11.14.16.pdf, 2010. Version 6/23/10; accessed 2026-01-13

work page 2010
[50]

Prochaska, Erin A

Judith J. Prochaska, Erin A. Vogel, Amy Chieng, Matthew Kendra, Michael Baiocchi, Sarah Pajarito, and Athena Robinson. A Therapeutic Relational Agent for Reducing Problematic Substance Use (Woebot): Development and Usability Study.Journal of Medical Internet Research, 23(3):e24850, 2021

work page 2021
[51]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2024

work page 2024
[52]

Regier, William E

Darrel A. Regier, William E. Narrow, Diana E. Clarke, Helena C. Kraemer, S. Janet Kuramoto, Emily A. Kuhl, and David J. Kupfer. DSM-5 field trials in the United States and Canada, Part II: test-retest reliability of selected categorical diagnoses.The American Journal of Psychiatry, 170(1):59–70, 2013

work page 2013
[53]

Large language models as mental health resources: Patterns of use in the united states, 2025

Tony Rousmaniere, Xu Li, Yimeng Zhang, and Siddharth Shah. Large language models as mental health resources: Patterns of use in the united states, 2025

work page 2025
[54]

Ruvini Sanjeewa, Ravi Iyer, Pragalathan Apputhurai, Nilmini Wickramasinghe, and Denny Meyer. Perception of Empathy in Mental Health Care Through Voice-Based Conversational Agent Prototypes: Experimental Study.Journal of Medical Internet Research Formative Research, 9:e69329, 2025

work page 2025
[55]

Lin, Adam S

Ashish Sharma, Inna W. Lin, Adam S. Miner, David C. Atkins, and Tim Althoff. Human–AI Collaboration Enables More Empathic Conversations in Text-Based Peer-to-Peer Mental Health Support.Nature Machine Intelligence, 5(1):46–57, 2023

work page 2023
[56]

A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support

Ashish Sharma, Adam Miner, David Atkins, and Tim Althoff. A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 5263–5276. Association for Computational Linguistics, 2020

work page 2020
[57]

P. E. Shrout and J. L. Fleiss. Intraclass Correlations: Uses in Assessing Rater Reliability.Psychological Bulletin, 86(2):420–428, 1979

work page 1979
[58]

Clinical Practice Guidelines on using artificial intelligence and gadgets for mental health and well-being.Indian Journal of Psychiatry, 66(Suppl 2):S414–S419, 2024

Vipul Singh, Sharmila Sarkar, Vikas Gaur, Sandeep Grover, and Om Prakash Singh. Clinical Practice Guidelines on using artificial intelligence and gadgets for mental health and well-being.Indian Journal of Psychiatry, 66(Suppl 2):S414–S419, 2024

work page 2024
[59]

Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera y Arcas,...

work page 2025
[60]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback.arXiv preprint, 2022

work page 2022
[61]

A Practical Guide to Fine-Tuning Language Models with Limited Data.arXiv preprint, 2024

Marton Szep, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, and Florian Hinterwimmer. A Practical Guide to Fine-Tuning Language Models with Limited Data.arXiv preprint, 2024

work page 2024
[62]

Lukoff, Keith Nuechterlein, R

Joseph Ventura, D. Lukoff, Keith Nuechterlein, R. P. Liberman, Megan Green, and Andrew Shaner. Brief Psychiatric Rating Scale Expanded version 4.0: Scales anchor points and administration manual.International Journal of Methods in Psychiatric Research, 13:221–244, 01 1993

work page 1993
[63]

Wang, Patricia Berglund, Mark Olfson, Harold A

Philip S. Wang, Patricia Berglund, Mark Olfson, Harold A. Pincus, Kenneth B. Wells, and Ronald C. Kessler. Failure and Delay in Initial Treatment Contact After First Onset of Mental Disorders in the National Comorbidity Survey Replication.Archives of General Psychiatry, 62(6):603–613, 2005

work page 2005
[64]

R. C. Young, J. T. Biggs, V. E. Ziegler, and D. A. Meyer. A rating scale for mania: reliability, validity and sensitivity.The British Journal of Psychiatry, 133(5):429–435, 1978

work page 1978
[65]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023

work page 2023
[66]

Cold plunges cure psychosis—stop your medication

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is More for Alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023. 18 Jafari et al. A Appendix A.1 Prompt Design Matrix Table 5. Clinical Conditions and Assessment Scales. Condition Risk Type A...

work page 2023

[1] [1]

Clinician-Rated Severity of Nonsuicidal Self-Injury

American Psychiatric Association. Clinician-Rated Severity of Nonsuicidal Self-Injury. https://www.psychiatry.org/File%20Library/Psychiatrists/ Practice/DSM/APA_DSM5_Clinician-Rated-Severity-of-Non-Suicidal-Self-Injury.pdf, 2013. DSM-5 Emerging Measure; accessed 2026-01-13

work page 2013

[2] [2]

DSM-5 Clinician-Rated Dimensions of Psychosis Symptom Severity

American Psychiatric Association. DSM-5 Clinician-Rated Dimensions of Psychosis Symptom Severity. https://www.psychiatry.org/File%20Library/ Psychiatrists/Practice/DSM/APA_DSM5_Clinician-Rated-Dimensions-of-Psychosis-Symptom-Severity.pdf, 2013. Accessed: 2026-01-13.©2013 American Psychiatric Association; reproduced with permission for clinical/research use

work page 2013

[3] [3]

DICES Dataset: Diversity in Conversational AI Evaluation for Safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

Lora Aroyo, Alex Taylor, Mark Díaz, Christopher Homan, Alicia Parrish, Gregory Serapio-García, Vinodkumar Prabhakaran, and Ding Wang. DICES Dataset: Diversity in Conversational AI Evaluation for Safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

work page 2023

[4] [4]

Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine, 36(1):15–24, 2015

Lora Aroyo and Chris Welty. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine, 36(1):15–24, 2015

work page 2015

[5] [5]

Crowd Truth: Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard

Lora Aroyo and Christopher Welty. Crowd Truth: Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard. InACM Web Science Conference, 2013

work page 2013

[6] [6]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page 2022

[7] [7]

Bech.Clinical Psychometrics

P. Bech.Clinical Psychometrics. John Wiley & Sons, 2nd edition, 2012

work page 2012

[8] [8]

A. T. Beck, A. Weissman, D. Lester, and L. Trexler. The measurement of pessimism: the hopelessness scale.Journal of Consulting and Clinical Psychology, 42(6):861–865, dec 1974

work page 1974

[9] [9]

Consensus report of the apa work group on neuroimaging markers of psychiatric disorders.Am Psychiatr Assoc, 2012

Kelly Botteron, Cameron Carter, Francisco Xavier Castellanos, Daniel P Dickstein, Wayne Drevets, Kerri L Kim, Matthew F Pescosolido, Scott Rausch, Karen E Seymour, Yvette Sheline, et al. Consensus report of the apa work group on neuroimaging markers of psychiatric disorders.Am Psychiatr Assoc, 2012

work page 2012

[10] [10]

Using Thematic Analysis in Psychology.Qualitative Research in Psychology, 3(2):77–101, 2006

Virginia Braun and Victoria Clarke. Using Thematic Analysis in Psychology.Qualitative Research in Psychology, 3(2):77–101, 2006

work page 2006

[11] [11]

Minton, Abigail Lott, and Jinho D

Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, and Jinho D. Choi. CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection, 2025

work page 2025

[12] [12]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.arXiv preprint, 2023

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.arXiv preprint, 2023

work page 2023

[13] [13]

How people use chatgpt

Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Technical report, National Bureau of Economic Research, 2025

work page 2025

[14] [14]

Predicting Depression via Social Media.International AAAI Conference on Web and Social Media, 7(1):128–137, 2013

Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. Predicting Depression via Social Media.International AAAI Conference on Web and Social Media, 7(1):128–137, 2013

work page 2013

[15] [15]

Deep Reinforcement Learning from Human Preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[16] [16]

Cicchetti

Domenic V. Cicchetti. Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology. Psychological Assessment, 6(4):284–290, 1994

work page 1994

[17] [17]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint, 2023

work page 2023

[18] [18]

Diagnostic and statistical manual of mental disorders.Am Psychiatric Assoc, 21(21):591–643, 2013

Fifth Edition et al. Diagnostic and statistical manual of mental disorders.Am Psychiatric Assoc, 21(21):591–643, 2013

work page 2013

[19] [19]

C. G. Fairburn and S. J. Beglin. Eating Disorder Examination Questionnaire (EDE-Q).International Journal of Eating Disorders, 1994. DOI: 10.1037/t03974-000

work page doi:10.1037/t03974-000 1994

[20] [20]

Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial.Journal of Medical Internet Research Mental Health, 4(2):e19, 2017

work page 2017

[21] [21]

Can AI relate: Testing large language model response for mental health support

Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. Can AI relate: Testing large language model response for mental health support. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics, pages 2206–2221, Miami, Florida, USA, 2024. Association for Computational Lingu...

work page 2024

[22] [22]

Impact of preference noise on the alignment performance of generative language models

Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of generative language models. InConference on Language Modeling, 2024

work page 2024

[23] [23]

Blind spots and biases: Exploring the role of annotator cognitive biases in NLP

Sanjana Gautam and Mukund Srinath. Blind spots and biases: Exploring the role of annotator cognitive biases in NLP. InWorkshop on Bridging Human–Computer Interaction and Natural Language Processing, pages 82–88, Mexico City, Mexico, 2024. Association for Computational Linguistics

work page 2024

[24] [24]

Goodman, Lawrence H

Wayne K. Goodman, Lawrence H. Price, Steven A. Rasmussen, Carolyn Mazure, Roberta L. Fleischmann, Candy L. Hill, George R. Heninger, and Dennis S. Charney. The Yale-Brown obsessive compulsive scale: I. Development, use, and reliability.Archives of General Psychiatry, 46(11):1006–1011, 1989

work page 1989

[25] [25]

Gordon, Michelle S

Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S. Bernstein. Jury Learning: Integrating Dissenting Voices into Machine Learning Models. InCHI Conference on Human Factors in Computing Systems, pages 1–19, New York, NY, USA, 2022. Association for Computing Machinery

work page 2022

[26] [26]

Risks from language models for automated mental healthcare: Ethics and structure for implementation

Declan Grabb, Max Lamparth, and Nina Vasan. Risks from language models for automated mental healthcare: Ethics and structure for implementation. InConference on Language Modeling, 2024

work page 2024

[27] [27]

Human Feedback is not Gold Standard.arXiv preprint, 2024

Tom Hosking, Phil Blunsom, and Max Bartolo. Human Feedback is not Gold Standard.arXiv preprint, 2024

work page 2024

[28] [28]

How LLM counselors violate ethical standards in mental health practice: A practitioner-informed framework

Zainab Iftikhar, Amy Xiao, Sean Ransom, Jeff Huang, and Harini Suresh. How LLM counselors violate ethical standards in mental health practice: A practitioner-informed framework. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1311–1323, 2025

work page 2025

[29] [29]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint, 2023

work page 2023

[30] [30]

Becky Inkster, Shubhankar Sarda, and Vinod Subramanian. An Empathy-Driven, Conversational Artificial Intelligence Agent (Wysa) for Digital Mental Well-Being: Real-World Data Evaluation Mixed-Methods Study.Journal of Medical Internet Research mHealth and uHealth, 6(11):e12106, 2018

work page 2018

[31] [31]

R. E. Kendell.The Role of Diagnosis in Psychiatry. The Role of Diagnosis in Psychiatry. Blackwell Scientific Publications, Oxford, England, 1975. Pages: viii, 176

work page 1975

[32] [32]

Reliability in Content Analysis: Some Common Misconceptions and Recommendations.Human Communication Research, 30(3):411–433, 2004

Klaus Krippendorff. Reliability in Content Analysis: Some Common Misconceptions and Recommendations.Human Communication Research, 30(3):411–433, 2004

work page 2004

[33] [33]

Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, and Colleen Waickman

Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, and Colleen Waickman. Moving beyond medical exam questions: A clinician-annotated dataset of real-world tasks and ambiguity in mental healthcare.arXiv preprint, 2025

work page 2025

[34] [34]

Matthew Large, Muthusamy Kaneson, Nicholas Myles, Hannah Myles, Pramudie Gunaratne, and Christopher Ryan. Meta-Analysis of Longitudinal Cohort Studies of Suicide Risk Assessment among Psychiatric Patients: Heterogeneity in Results and Lack of Improvement over Time.PloS One, 11(6):e0156322, 2016

work page 2016

[35] [35]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023

work page 2023

[36] [36]

Bunyi, Adam C

Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, and Ruishan Liu. CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering, 2025

work page 2025

[37] [37]

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.Journal of Medical Internet Research AI, 3:e52095, 2024

Zoltan P Majdik, S Scott Graham, Jade C Shiva Edward, Sabrina N Rodriguez, Martha S Karnes, Jared T Jensen, Joshua B Barbour, and Justin F Rousseau. Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.Journal of Medical Internet Research AI, 3:e52095, 2024

work page 2024

[38] [38]

A diagnostic meta-analysis of the patient health questionnaire-9 (phq-9) algorithm scoring method as a screen for depression.General hospital psychiatry, 37(1):67–75, 2015

Laura Manea, Simon Gilbody, and Dean McMillan. A diagnostic meta-analysis of the patient health questionnaire-9 (phq-9) algorithm scoring method as a screen for depression.General hospital psychiatry, 37(1):67–75, 2015

work page 2015

[39] [39]

McGraw and S

Kenneth O. McGraw and S. P. Wong. Forming Inferences About Some Intraclass Correlation Coefficients.Psychological Methods, 1(1):30–46, 1996

work page 1996

[40] [40]

Ong, and Nick Haber

Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. In2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, page 599–627, New York, NY, USA, 2025. Association for Computin...

work page 2025

[41] [41]

Moyers, Lauren N

Theresa B. Moyers, Lauren N. Rowell, Jennifer K. Manuel, Denise Ernst, and Jon M. Houck. The Motivational Interviewing Treatment Integrity Code (MITI 4): Rationale, Preliminary Reliability and Validity.Journal of Substance Abuse Treatment, 65:36–42, 2016

work page 2016

[42] [42]

Department of Veterans Affairs

National Center for PTSD, U.S. Department of Veterans Affairs. Clinician-Administered PTSD Scale for DSM-5 (CAPS-5): Past Week Version. https://www.ptsd.va.gov/professional/assessment/documents/CAPS_5_Past_Week.pdf, 2015. Assessment instrument; accessed 2026-01-13

work page 2015

[43] [43]

NICHQ Vanderbilt Assessment Scales

National Institute for Children’s Health Quality (NICHQ). NICHQ Vanderbilt Assessment Scales. https://nichq.org/wp-content/uploads/2024/09/ NICHQ-Vanderbilt-Assessment-Scales.pdf, 2002. Assessment instrument; accessed 2026-01-13

work page 2024

[44] [44]

Depression

National Institute of Mental Health. Depression. https://www.nimh.nih.gov/health/publications/depression, 2024. NIH Publication No. 24-MH-8079

work page 2024

[45] [45]

Enhancing mental health with artificial intelligence: Current trends and future prospects.Journal of medicine, surgery, and public health, 3:100099, 2024

David B Olawade, Ojima Z Wada, Aderonke Odetayo, Aanuoluwapo Clement David-Olawade, Fiyinfoluwa Asaolu, and Judith Eberhardt. Enhancing mental health with artificial intelligence: Current trends and future prospects.Journal of medicine, surgery, and public health, 3:100099, 2024

work page 2024

[46] [46]

Christiano, Jan Leike, and Ryan Lowe

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

work page 2022

[47] [47]

Inherent Disagreements in Human Textual Inferences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019

Ellie Pavlick and Tom Kwiatkowski. Inherent Disagreements in Human Textual Inferences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019

work page 2019

[48] [48]

Red Teaming Language Models with Language Models, 2022

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red Teaming Language Models with Language Models, 2022

work page 2022

[49] [49]

Posner, D

K. Posner, D. Brent, C. Lucas, M. Gould, B. Stanley, G. Brown, P. Fisher, J. Zelazny, A. Burke, M. Oquendo, and J. Mann. Columbia-Suicide Severity Rating Scale (C-SSRS): Pediatric – Since Last Contact – Communities and Healthcare. https://cssrs.columbia.edu/wp-content/uploads/C- SSRS_Pediatric-SLC_11.14.16.pdf, 2010. Version 6/23/10; accessed 2026-01-13

work page 2010

[50] [50]

Prochaska, Erin A

Judith J. Prochaska, Erin A. Vogel, Amy Chieng, Matthew Kendra, Michael Baiocchi, Sarah Pajarito, and Athena Robinson. A Therapeutic Relational Agent for Reducing Problematic Substance Use (Woebot): Development and Usability Study.Journal of Medical Internet Research, 23(3):e24850, 2021

work page 2021

[51] [51]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2024

work page 2024

[52] [52]

Regier, William E

Darrel A. Regier, William E. Narrow, Diana E. Clarke, Helena C. Kraemer, S. Janet Kuramoto, Emily A. Kuhl, and David J. Kupfer. DSM-5 field trials in the United States and Canada, Part II: test-retest reliability of selected categorical diagnoses.The American Journal of Psychiatry, 170(1):59–70, 2013

work page 2013

[53] [53]

Large language models as mental health resources: Patterns of use in the united states, 2025

Tony Rousmaniere, Xu Li, Yimeng Zhang, and Siddharth Shah. Large language models as mental health resources: Patterns of use in the united states, 2025

work page 2025

[54] [54]

Ruvini Sanjeewa, Ravi Iyer, Pragalathan Apputhurai, Nilmini Wickramasinghe, and Denny Meyer. Perception of Empathy in Mental Health Care Through Voice-Based Conversational Agent Prototypes: Experimental Study.Journal of Medical Internet Research Formative Research, 9:e69329, 2025

work page 2025

[55] [55]

Lin, Adam S

Ashish Sharma, Inna W. Lin, Adam S. Miner, David C. Atkins, and Tim Althoff. Human–AI Collaboration Enables More Empathic Conversations in Text-Based Peer-to-Peer Mental Health Support.Nature Machine Intelligence, 5(1):46–57, 2023

work page 2023

[56] [56]

A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support

Ashish Sharma, Adam Miner, David Atkins, and Tim Althoff. A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 5263–5276. Association for Computational Linguistics, 2020

work page 2020

[57] [57]

P. E. Shrout and J. L. Fleiss. Intraclass Correlations: Uses in Assessing Rater Reliability.Psychological Bulletin, 86(2):420–428, 1979

work page 1979

[58] [58]

Clinical Practice Guidelines on using artificial intelligence and gadgets for mental health and well-being.Indian Journal of Psychiatry, 66(Suppl 2):S414–S419, 2024

Vipul Singh, Sharmila Sarkar, Vikas Gaur, Sandeep Grover, and Om Prakash Singh. Clinical Practice Guidelines on using artificial intelligence and gadgets for mental health and well-being.Indian Journal of Psychiatry, 66(Suppl 2):S414–S419, 2024

work page 2024

[59] [59]

Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera y Arcas,...

work page 2025

[60] [60]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback.arXiv preprint, 2022

work page 2022

[61] [61]

A Practical Guide to Fine-Tuning Language Models with Limited Data.arXiv preprint, 2024

Marton Szep, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, and Florian Hinterwimmer. A Practical Guide to Fine-Tuning Language Models with Limited Data.arXiv preprint, 2024

work page 2024

[62] [62]

Lukoff, Keith Nuechterlein, R

Joseph Ventura, D. Lukoff, Keith Nuechterlein, R. P. Liberman, Megan Green, and Andrew Shaner. Brief Psychiatric Rating Scale Expanded version 4.0: Scales anchor points and administration manual.International Journal of Methods in Psychiatric Research, 13:221–244, 01 1993

work page 1993

[63] [63]

Wang, Patricia Berglund, Mark Olfson, Harold A

Philip S. Wang, Patricia Berglund, Mark Olfson, Harold A. Pincus, Kenneth B. Wells, and Ronald C. Kessler. Failure and Delay in Initial Treatment Contact After First Onset of Mental Disorders in the National Comorbidity Survey Replication.Archives of General Psychiatry, 62(6):603–613, 2005

work page 2005

[64] [64]

R. C. Young, J. T. Biggs, V. E. Ziegler, and D. A. Meyer. A rating scale for mania: reliability, validity and sensitivity.The British Journal of Psychiatry, 133(5):429–435, 1978

work page 1978

[65] [65]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023

work page 2023

[66] [66]

Cold plunges cure psychosis—stop your medication

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is More for Alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023. 18 Jafari et al. A Appendix A.1 Prompt Design Matrix Table 5. Clinical Conditions and Assessment Scales. Condition Risk Type A...

work page 2023