pith. sign in

arxiv: 2601.18061 · v3 · submitted 2026-01-26 · 💻 cs.AI · cs.HC

Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing

Pith reviewed 2026-05-16 11:43 UTC · model grok-4.3

classification 💻 cs.AI cs.HC
keywords mental health AIexpert evaluationhuman feedbackinter-rater reliabilityAI safetyLLM responsesclinical disagreementsuicide risk assessment
0
0 comments X

The pith

Aggregated expert judgments in mental health AI safety testing erase distinct clinical philosophies and yield unreliable ground truth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the core assumption in learning from human feedback that averaging expert ratings produces valid training and evaluation data for AI. Three psychiatrists applied a shared rubric to LLM-generated mental health responses and showed consistently low inter-rater reliability, with the worst scores on suicide and self-harm items. Qualitative interviews showed that the disagreement arose from coherent but incompatible individual clinical orientations rather than random error or poor training. Because these orientations reflect real professional judgment, simple averaging creates compromise labels that discard the underlying reasoning. In safety-critical domains this undermines the use of consensus-based human feedback for reward models and benchmarks.

Core claim

Aggregated expert labels function as arithmetic compromises that effectively erase grounded professional philosophies. Expert disagreement in safety-critical AI is a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence rather than measurement error.

What carries the argument

Inter-rater reliability statistics (ICC and Krippendorff's alpha) paired with qualitative analysis of psychiatrist interviews that map responses to distinct clinical frameworks such as safety-first, engagement-centered, and culturally-informed orientations.

If this is right

  • Reward modeling for mental health AI must treat expert disagreement as structured signal rather than noise to be averaged away.
  • Safety classification benchmarks should move away from single consensus labels toward representations that keep separate expert frameworks visible.
  • Evaluation protocols need methods that learn from multiple professional heuristics instead of forcing arithmetic agreement.
  • Training pipelines that preserve and model individual expert philosophies could produce AI systems better aligned with real clinical practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structured divergence may appear in other high-stakes expert domains such as legal review or medical diagnosis AI.
  • Alignment techniques could be developed to train separate models on each coherent expert framework rather than a single averaged dataset.
  • Practitioners might test whether AI performance on safety tasks improves when models are exposed to disaggregated expert labels during training.

Load-bearing premise

The disagreement patterns seen with these three psychiatrists and this rubric would appear similarly with other experts and different evaluation instruments.

What would settle it

A replication study with a larger panel of psychiatrists evaluating comparable LLM responses that reports high inter-rater reliability (ICC above 0.7) across safety-critical items would falsify the central claim.

read the original abstract

Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ($ICC$ $0.087$--$0.295$), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff's $\alpha = -0.203$), indicating structured disagreement worse than chance. Qualitative interviews revealed that disagreement reflects coherent but incompatible individual clinical frameworks, safety-first, engagement-centered, and culturally-informed orientations, rather than measurement error. By demonstrating that experts rely on holistic risk heuristics rather than granular factor discrimination, these findings suggest that aggregated labels function as arithmetic compromises that effectively erase grounded professional philosophies. Our results characterize expert disagreement in safety-critical AI as a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence. We discuss implications for reward modeling, safety classification, and evaluation benchmarks, recommending that practitioners shift from consensus-based aggregation to alignment methods that preserve and learn from expert disagreement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript examines the limits of aggregated human feedback in mental health AI safety testing by having three psychiatrists evaluate LLM-generated responses using a calibrated rubric. It reports poor inter-rater reliability (ICC ranging from 0.087 to 0.295, with one factor showing Krippendorff's α = -0.203) and uses qualitative interviews to argue that disagreements stem from coherent but incompatible clinical frameworks (safety-first, engagement-centered, culturally-informed), rather than random error. The central claim is that aggregated labels erase grounded professional philosophies, and expert disagreement in safety-critical AI should be treated as a sociotechnical phenomenon requiring alignment methods that preserve divergence.

Significance. If the results are robust, this work is significant for challenging core assumptions in RLHF and LHF for high-stakes domains. It provides empirical evidence that expert consensus may not be appropriate for mental health safety evaluation, with implications for reward modeling, safety classification, and benchmarks. The combination of quantitative reliability metrics and qualitative insights into frameworks is a strength, highlighting the need for methods that learn from disagreement.

major comments (3)
  1. [Methods] The study relies on only three certified psychiatrists as raters. With such a small sample, the ICC and Krippendorff's alpha estimates are subject to high sampling variability; the negative alpha could arise from a single outlier or specific rubric features rather than indicating structured disagreement worse than chance.
  2. [Results] No information is provided on the total number of LLM responses evaluated or the number of rubric items per category. This omission prevents assessment of whether the reported disagreement patterns (e.g., highest on suicide/self-harm) have sufficient statistical power or are generalizable beyond the specific sample.
  3. [Discussion] The interpretation that disagreement reflects 'coherent but incompatible individual clinical frameworks' is based on qualitative interviews, but the manuscript does not provide quantitative evidence, such as correlation between rater frameworks and rating patterns or inter-rater agreement within framework groups, to support that the divergence is principled rather than due to other sources of variance.
minor comments (2)
  1. [Abstract] The range for ICC is given as 0.087–0.295, but it is unclear which specific factors or items correspond to the lower and upper bounds; specifying this would improve clarity.
  2. [Introduction] The paper could benefit from citing more prior work on inter-rater reliability in clinical psychology or AI safety evaluations to contextualize the findings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us improve the clarity and robustness of our manuscript. We address each of the major comments below.

read point-by-point responses
  1. Referee: [Methods] The study relies on only three certified psychiatrists as raters. With such a small sample, the ICC and Krippendorff's alpha estimates are subject to high sampling variability; the negative alpha could arise from a single outlier or specific rubric features rather than indicating structured disagreement worse than chance.

    Authors: We acknowledge the limitation of having only three raters, which does introduce potential for high sampling variability in the reliability metrics. However, the fact that we observed consistently low ICC values across different factors and a negative alpha on a key safety item points to systematic disagreement. To strengthen this, we have added bootstrap resampling to estimate confidence intervals for the ICC and alpha values in the revised Methods and Results sections. We also examined the data for outliers and confirmed that the negative alpha persists even after sensitivity checks. revision: partial

  2. Referee: [Results] No information is provided on the total number of LLM responses evaluated or the number of rubric items per category. This omission prevents assessment of whether the reported disagreement patterns (e.g., highest on suicide/self-harm) have sufficient statistical power or are generalizable beyond the specific sample.

    Authors: We have revised the manuscript to include the total number of LLM responses evaluated and the number of rubric items per category in the Methods section. This allows for assessment of statistical power and generalizability of the disagreement patterns. revision: yes

  3. Referee: [Discussion] The interpretation that disagreement reflects 'coherent but incompatible individual clinical frameworks' is based on qualitative interviews, but the manuscript does not provide quantitative evidence, such as correlation between rater frameworks and rating patterns or inter-rater agreement within framework groups, to support that the divergence is principled rather than due to other sources of variance.

    Authors: The qualitative component was intended to provide interpretive depth to the quantitative reliability findings. We have expanded the Discussion to include additional quotes from the interviews that directly map each psychiatrist's framework to their specific rating behaviors on the rubric items. While the small number of raters precludes formal quantitative correlation analyses, the alignment between self-described frameworks and observed rating patterns is evident in the data. We have added a supplementary table illustrating this mapping. revision: partial

Circularity Check

0 steps flagged

Empirical measurement study with no derivation chain or fitted predictions

full rationale

The paper reports an empirical study collecting ratings from three psychiatrists on LLM responses, then applies standard external statistics (ICC, Krippendorff’s α) to the new data and supplements with qualitative interviews. No equations, derivations, or predictions are claimed; the central claims about disagreement patterns follow directly from the observed reliability coefficients and interview content without reducing to fitted parameters or self-referential inputs by construction. No self-citations are load-bearing for any result. This is a normal non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the three psychiatrists represent expert judgment and that the rubric captures relevant safety dimensions; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Expert clinical judgment can be elicited via a calibrated rubric and aggregated or compared using standard reliability statistics.
    Invoked when interpreting ICC values as evidence of poor consensus.

pith-pipeline@v0.9.0 · 5582 in / 1236 out tokens · 19938 ms · 2026-05-16T11:43:20.365295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

  1. [1]

    Clinician-Rated Severity of Nonsuicidal Self-Injury

    American Psychiatric Association. Clinician-Rated Severity of Nonsuicidal Self-Injury. https://www.psychiatry.org/File%20Library/Psychiatrists/ Practice/DSM/APA_DSM5_Clinician-Rated-Severity-of-Non-Suicidal-Self-Injury.pdf, 2013. DSM-5 Emerging Measure; accessed 2026-01-13

  2. [2]

    DSM-5 Clinician-Rated Dimensions of Psychosis Symptom Severity

    American Psychiatric Association. DSM-5 Clinician-Rated Dimensions of Psychosis Symptom Severity. https://www.psychiatry.org/File%20Library/ Psychiatrists/Practice/DSM/APA_DSM5_Clinician-Rated-Dimensions-of-Psychosis-Symptom-Severity.pdf, 2013. Accessed: 2026-01-13.©2013 American Psychiatric Association; reproduced with permission for clinical/research use

  3. [3]

    DICES Dataset: Diversity in Conversational AI Evaluation for Safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

    Lora Aroyo, Alex Taylor, Mark Díaz, Christopher Homan, Alicia Parrish, Gregory Serapio-García, Vinodkumar Prabhakaran, and Ding Wang. DICES Dataset: Diversity in Conversational AI Evaluation for Safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023

  4. [4]

    Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine, 36(1):15–24, 2015

    Lora Aroyo and Chris Welty. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine, 36(1):15–24, 2015

  5. [5]

    Crowd Truth: Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard

    Lora Aroyo and Christopher Welty. Crowd Truth: Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard. InACM Web Science Conference, 2013

  6. [6]

    Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  7. [7]

    Bech.Clinical Psychometrics

    P. Bech.Clinical Psychometrics. John Wiley & Sons, 2nd edition, 2012

  8. [8]

    A. T. Beck, A. Weissman, D. Lester, and L. Trexler. The measurement of pessimism: the hopelessness scale.Journal of Consulting and Clinical Psychology, 42(6):861–865, dec 1974

  9. [9]

    Consensus report of the apa work group on neuroimaging markers of psychiatric disorders.Am Psychiatr Assoc, 2012

    Kelly Botteron, Cameron Carter, Francisco Xavier Castellanos, Daniel P Dickstein, Wayne Drevets, Kerri L Kim, Matthew F Pescosolido, Scott Rausch, Karen E Seymour, Yvette Sheline, et al. Consensus report of the apa work group on neuroimaging markers of psychiatric disorders.Am Psychiatr Assoc, 2012

  10. [10]

    Using Thematic Analysis in Psychology.Qualitative Research in Psychology, 3(2):77–101, 2006

    Virginia Braun and Victoria Clarke. Using Thematic Analysis in Psychology.Qualitative Research in Psychology, 3(2):77–101, 2006

  11. [11]

    Minton, Abigail Lott, and Jinho D

    Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, and Jinho D. Choi. CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection, 2025

  12. [12]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.arXiv preprint, 2023

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.arXiv preprint, 2023

  13. [13]

    How people use chatgpt

    Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Technical report, National Bureau of Economic Research, 2025

  14. [14]

    Predicting Depression via Social Media.International AAAI Conference on Web and Social Media, 7(1):128–137, 2013

    Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. Predicting Depression via Social Media.International AAAI Conference on Web and Social Media, 7(1):128–137, 2013

  15. [15]

    Deep Reinforcement Learning from Human Preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

  16. [16]

    Cicchetti

    Domenic V. Cicchetti. Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology. Psychological Assessment, 6(4):284–290, 1994

  17. [17]

    Hashimoto

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint, 2023

  18. [18]

    Diagnostic and statistical manual of mental disorders.Am Psychiatric Assoc, 21(21):591–643, 2013

    Fifth Edition et al. Diagnostic and statistical manual of mental disorders.Am Psychiatric Assoc, 21(21):591–643, 2013

  19. [19]

    C. G. Fairburn and S. J. Beglin. Eating Disorder Examination Questionnaire (EDE-Q).International Journal of Eating Disorders, 1994. DOI: 10.1037/t03974-000

  20. [20]

    Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial.Journal of Medical Internet Research Mental Health, 4(2):e19, 2017

  21. [21]

    Can AI relate: Testing large language model response for mental health support

    Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. Can AI relate: Testing large language model response for mental health support. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics, pages 2206–2221, Miami, Florida, USA, 2024. Association for Computational Lingu...

  22. [22]

    Impact of preference noise on the alignment performance of generative language models

    Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of generative language models. InConference on Language Modeling, 2024

  23. [23]

    Blind spots and biases: Exploring the role of annotator cognitive biases in NLP

    Sanjana Gautam and Mukund Srinath. Blind spots and biases: Exploring the role of annotator cognitive biases in NLP. InWorkshop on Bridging Human–Computer Interaction and Natural Language Processing, pages 82–88, Mexico City, Mexico, 2024. Association for Computational Linguistics

  24. [24]

    Goodman, Lawrence H

    Wayne K. Goodman, Lawrence H. Price, Steven A. Rasmussen, Carolyn Mazure, Roberta L. Fleischmann, Candy L. Hill, George R. Heninger, and Dennis S. Charney. The Yale-Brown obsessive compulsive scale: I. Development, use, and reliability.Archives of General Psychiatry, 46(11):1006–1011, 1989

  25. [25]

    Gordon, Michelle S

    Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S. Bernstein. Jury Learning: Integrating Dissenting Voices into Machine Learning Models. InCHI Conference on Human Factors in Computing Systems, pages 1–19, New York, NY, USA, 2022. Association for Computing Machinery

  26. [26]

    Risks from language models for automated mental healthcare: Ethics and structure for implementation

    Declan Grabb, Max Lamparth, and Nina Vasan. Risks from language models for automated mental healthcare: Ethics and structure for implementation. InConference on Language Modeling, 2024

  27. [27]

    Human Feedback is not Gold Standard.arXiv preprint, 2024

    Tom Hosking, Phil Blunsom, and Max Bartolo. Human Feedback is not Gold Standard.arXiv preprint, 2024

  28. [28]

    How LLM counselors violate ethical standards in mental health practice: A practitioner-informed framework

    Zainab Iftikhar, Amy Xiao, Sean Ransom, Jeff Huang, and Harini Suresh. How LLM counselors violate ethical standards in mental health practice: A practitioner-informed framework. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1311–1323, 2025

  29. [29]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint, 2023

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint, 2023

  30. [30]

    Becky Inkster, Shubhankar Sarda, and Vinod Subramanian. An Empathy-Driven, Conversational Artificial Intelligence Agent (Wysa) for Digital Mental Well-Being: Real-World Data Evaluation Mixed-Methods Study.Journal of Medical Internet Research mHealth and uHealth, 6(11):e12106, 2018

  31. [31]

    R. E. Kendell.The Role of Diagnosis in Psychiatry. The Role of Diagnosis in Psychiatry. Blackwell Scientific Publications, Oxford, England, 1975. Pages: viii, 176

  32. [32]

    Reliability in Content Analysis: Some Common Misconceptions and Recommendations.Human Communication Research, 30(3):411–433, 2004

    Klaus Krippendorff. Reliability in Content Analysis: Some Common Misconceptions and Recommendations.Human Communication Research, 30(3):411–433, 2004

  33. [33]

    Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, and Colleen Waickman

    Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, and Colleen Waickman. Moving beyond medical exam questions: A clinician-annotated dataset of real-world tasks and ambiguity in mental healthcare.arXiv preprint, 2025

  34. [34]

    Matthew Large, Muthusamy Kaneson, Nicholas Myles, Hannah Myles, Pramudie Gunaratne, and Christopher Ryan. Meta-Analysis of Longitudinal Cohort Studies of Suicide Risk Assessment among Psychiatric Patients: Heterogeneity in Results and Lack of Improvement over Time.PloS One, 11(6):e0156322, 2016

  35. [35]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023

  36. [36]

    Bunyi, Adam C

    Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, and Ruishan Liu. CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering, 2025

  37. [37]

    Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.Journal of Medical Internet Research AI, 3:e52095, 2024

    Zoltan P Majdik, S Scott Graham, Jade C Shiva Edward, Sabrina N Rodriguez, Martha S Karnes, Jared T Jensen, Joshua B Barbour, and Justin F Rousseau. Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.Journal of Medical Internet Research AI, 3:e52095, 2024

  38. [38]

    A diagnostic meta-analysis of the patient health questionnaire-9 (phq-9) algorithm scoring method as a screen for depression.General hospital psychiatry, 37(1):67–75, 2015

    Laura Manea, Simon Gilbody, and Dean McMillan. A diagnostic meta-analysis of the patient health questionnaire-9 (phq-9) algorithm scoring method as a screen for depression.General hospital psychiatry, 37(1):67–75, 2015

  39. [39]

    McGraw and S

    Kenneth O. McGraw and S. P. Wong. Forming Inferences About Some Intraclass Correlation Coefficients.Psychological Methods, 1(1):30–46, 1996

  40. [40]

    Ong, and Nick Haber

    Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. In2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, page 599–627, New York, NY, USA, 2025. Association for Computin...

  41. [41]

    Moyers, Lauren N

    Theresa B. Moyers, Lauren N. Rowell, Jennifer K. Manuel, Denise Ernst, and Jon M. Houck. The Motivational Interviewing Treatment Integrity Code (MITI 4): Rationale, Preliminary Reliability and Validity.Journal of Substance Abuse Treatment, 65:36–42, 2016

  42. [42]

    Department of Veterans Affairs

    National Center for PTSD, U.S. Department of Veterans Affairs. Clinician-Administered PTSD Scale for DSM-5 (CAPS-5): Past Week Version. https://www.ptsd.va.gov/professional/assessment/documents/CAPS_5_Past_Week.pdf, 2015. Assessment instrument; accessed 2026-01-13

  43. [43]

    NICHQ Vanderbilt Assessment Scales

    National Institute for Children’s Health Quality (NICHQ). NICHQ Vanderbilt Assessment Scales. https://nichq.org/wp-content/uploads/2024/09/ NICHQ-Vanderbilt-Assessment-Scales.pdf, 2002. Assessment instrument; accessed 2026-01-13

  44. [44]

    Depression

    National Institute of Mental Health. Depression. https://www.nimh.nih.gov/health/publications/depression, 2024. NIH Publication No. 24-MH-8079

  45. [45]

    Enhancing mental health with artificial intelligence: Current trends and future prospects.Journal of medicine, surgery, and public health, 3:100099, 2024

    David B Olawade, Ojima Z Wada, Aderonke Odetayo, Aanuoluwapo Clement David-Olawade, Fiyinfoluwa Asaolu, and Judith Eberhardt. Enhancing mental health with artificial intelligence: Current trends and future prospects.Journal of medicine, surgery, and public health, 3:100099, 2024

  46. [46]

    Christiano, Jan Leike, and Ryan Lowe

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

  47. [47]

    Inherent Disagreements in Human Textual Inferences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019

    Ellie Pavlick and Tom Kwiatkowski. Inherent Disagreements in Human Textual Inferences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019

  48. [48]

    Red Teaming Language Models with Language Models, 2022

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red Teaming Language Models with Language Models, 2022

  49. [49]

    Posner, D

    K. Posner, D. Brent, C. Lucas, M. Gould, B. Stanley, G. Brown, P. Fisher, J. Zelazny, A. Burke, M. Oquendo, and J. Mann. Columbia-Suicide Severity Rating Scale (C-SSRS): Pediatric – Since Last Contact – Communities and Healthcare. https://cssrs.columbia.edu/wp-content/uploads/C- SSRS_Pediatric-SLC_11.14.16.pdf, 2010. Version 6/23/10; accessed 2026-01-13

  50. [50]

    Prochaska, Erin A

    Judith J. Prochaska, Erin A. Vogel, Amy Chieng, Matthew Kendra, Michael Baiocchi, Sarah Pajarito, and Athena Robinson. A Therapeutic Relational Agent for Reducing Problematic Substance Use (Woebot): Development and Usability Study.Journal of Medical Internet Research, 23(3):e24850, 2021

  51. [51]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2024

  52. [52]

    Regier, William E

    Darrel A. Regier, William E. Narrow, Diana E. Clarke, Helena C. Kraemer, S. Janet Kuramoto, Emily A. Kuhl, and David J. Kupfer. DSM-5 field trials in the United States and Canada, Part II: test-retest reliability of selected categorical diagnoses.The American Journal of Psychiatry, 170(1):59–70, 2013

  53. [53]

    Large language models as mental health resources: Patterns of use in the united states, 2025

    Tony Rousmaniere, Xu Li, Yimeng Zhang, and Siddharth Shah. Large language models as mental health resources: Patterns of use in the united states, 2025

  54. [54]

    Ruvini Sanjeewa, Ravi Iyer, Pragalathan Apputhurai, Nilmini Wickramasinghe, and Denny Meyer. Perception of Empathy in Mental Health Care Through Voice-Based Conversational Agent Prototypes: Experimental Study.Journal of Medical Internet Research Formative Research, 9:e69329, 2025

  55. [55]

    Lin, Adam S

    Ashish Sharma, Inna W. Lin, Adam S. Miner, David C. Atkins, and Tim Althoff. Human–AI Collaboration Enables More Empathic Conversations in Text-Based Peer-to-Peer Mental Health Support.Nature Machine Intelligence, 5(1):46–57, 2023

  56. [56]

    A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support

    Ashish Sharma, Adam Miner, David Atkins, and Tim Althoff. A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 5263–5276. Association for Computational Linguistics, 2020

  57. [57]

    P. E. Shrout and J. L. Fleiss. Intraclass Correlations: Uses in Assessing Rater Reliability.Psychological Bulletin, 86(2):420–428, 1979

  58. [58]

    Clinical Practice Guidelines on using artificial intelligence and gadgets for mental health and well-being.Indian Journal of Psychiatry, 66(Suppl 2):S414–S419, 2024

    Vipul Singh, Sharmila Sarkar, Vikas Gaur, Sandeep Grover, and Om Prakash Singh. Clinical Practice Guidelines on using artificial intelligence and gadgets for mental health and well-being.Indian Journal of Psychiatry, 66(Suppl 2):S414–S419, 2024

  59. [59]

    Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera y Arcas,...

  60. [60]

    Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback.arXiv preprint, 2022

  61. [61]

    A Practical Guide to Fine-Tuning Language Models with Limited Data.arXiv preprint, 2024

    Marton Szep, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, and Florian Hinterwimmer. A Practical Guide to Fine-Tuning Language Models with Limited Data.arXiv preprint, 2024

  62. [62]

    Lukoff, Keith Nuechterlein, R

    Joseph Ventura, D. Lukoff, Keith Nuechterlein, R. P. Liberman, Megan Green, and Andrew Shaner. Brief Psychiatric Rating Scale Expanded version 4.0: Scales anchor points and administration manual.International Journal of Methods in Psychiatric Research, 13:221–244, 01 1993

  63. [63]

    Wang, Patricia Berglund, Mark Olfson, Harold A

    Philip S. Wang, Patricia Berglund, Mark Olfson, Harold A. Pincus, Kenneth B. Wells, and Ronald C. Kessler. Failure and Delay in Initial Treatment Contact After First Onset of Mental Disorders in the National Comorbidity Survey Replication.Archives of General Psychiatry, 62(6):603–613, 2005

  64. [64]

    R. C. Young, J. T. Biggs, V. E. Ziegler, and D. A. Meyer. A rating scale for mania: reliability, validity and sensitivity.The British Journal of Psychiatry, 133(5):429–435, 1978

  65. [65]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023

  66. [66]

    Cold plunges cure psychosis—stop your medication

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is More for Alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023. 18 Jafari et al. A Appendix A.1 Prompt Design Matrix Table 5. Clinical Conditions and Assessment Scales. Condition Risk Type A...