Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing
Pith reviewed 2026-05-16 11:43 UTC · model grok-4.3
The pith
Aggregated expert judgments in mental health AI safety testing erase distinct clinical philosophies and yield unreliable ground truth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Aggregated expert labels function as arithmetic compromises that effectively erase grounded professional philosophies. Expert disagreement in safety-critical AI is a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence rather than measurement error.
What carries the argument
Inter-rater reliability statistics (ICC and Krippendorff's alpha) paired with qualitative analysis of psychiatrist interviews that map responses to distinct clinical frameworks such as safety-first, engagement-centered, and culturally-informed orientations.
If this is right
- Reward modeling for mental health AI must treat expert disagreement as structured signal rather than noise to be averaged away.
- Safety classification benchmarks should move away from single consensus labels toward representations that keep separate expert frameworks visible.
- Evaluation protocols need methods that learn from multiple professional heuristics instead of forcing arithmetic agreement.
- Training pipelines that preserve and model individual expert philosophies could produce AI systems better aligned with real clinical practice.
Where Pith is reading between the lines
- The same structured divergence may appear in other high-stakes expert domains such as legal review or medical diagnosis AI.
- Alignment techniques could be developed to train separate models on each coherent expert framework rather than a single averaged dataset.
- Practitioners might test whether AI performance on safety tasks improves when models are exposed to disaggregated expert labels during training.
Load-bearing premise
The disagreement patterns seen with these three psychiatrists and this rubric would appear similarly with other experts and different evaluation instruments.
What would settle it
A replication study with a larger panel of psychiatrists evaluating comparable LLM responses that reports high inter-rater reliability (ICC above 0.7) across safety-critical items would falsify the central claim.
read the original abstract
Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ($ICC$ $0.087$--$0.295$), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff's $\alpha = -0.203$), indicating structured disagreement worse than chance. Qualitative interviews revealed that disagreement reflects coherent but incompatible individual clinical frameworks, safety-first, engagement-centered, and culturally-informed orientations, rather than measurement error. By demonstrating that experts rely on holistic risk heuristics rather than granular factor discrimination, these findings suggest that aggregated labels function as arithmetic compromises that effectively erase grounded professional philosophies. Our results characterize expert disagreement in safety-critical AI as a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence. We discuss implications for reward modeling, safety classification, and evaluation benchmarks, recommending that practitioners shift from consensus-based aggregation to alignment methods that preserve and learn from expert disagreement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines the limits of aggregated human feedback in mental health AI safety testing by having three psychiatrists evaluate LLM-generated responses using a calibrated rubric. It reports poor inter-rater reliability (ICC ranging from 0.087 to 0.295, with one factor showing Krippendorff's α = -0.203) and uses qualitative interviews to argue that disagreements stem from coherent but incompatible clinical frameworks (safety-first, engagement-centered, culturally-informed), rather than random error. The central claim is that aggregated labels erase grounded professional philosophies, and expert disagreement in safety-critical AI should be treated as a sociotechnical phenomenon requiring alignment methods that preserve divergence.
Significance. If the results are robust, this work is significant for challenging core assumptions in RLHF and LHF for high-stakes domains. It provides empirical evidence that expert consensus may not be appropriate for mental health safety evaluation, with implications for reward modeling, safety classification, and benchmarks. The combination of quantitative reliability metrics and qualitative insights into frameworks is a strength, highlighting the need for methods that learn from disagreement.
major comments (3)
- [Methods] The study relies on only three certified psychiatrists as raters. With such a small sample, the ICC and Krippendorff's alpha estimates are subject to high sampling variability; the negative alpha could arise from a single outlier or specific rubric features rather than indicating structured disagreement worse than chance.
- [Results] No information is provided on the total number of LLM responses evaluated or the number of rubric items per category. This omission prevents assessment of whether the reported disagreement patterns (e.g., highest on suicide/self-harm) have sufficient statistical power or are generalizable beyond the specific sample.
- [Discussion] The interpretation that disagreement reflects 'coherent but incompatible individual clinical frameworks' is based on qualitative interviews, but the manuscript does not provide quantitative evidence, such as correlation between rater frameworks and rating patterns or inter-rater agreement within framework groups, to support that the divergence is principled rather than due to other sources of variance.
minor comments (2)
- [Abstract] The range for ICC is given as 0.087–0.295, but it is unclear which specific factors or items correspond to the lower and upper bounds; specifying this would improve clarity.
- [Introduction] The paper could benefit from citing more prior work on inter-rater reliability in clinical psychology or AI safety evaluations to contextualize the findings.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has helped us improve the clarity and robustness of our manuscript. We address each of the major comments below.
read point-by-point responses
-
Referee: [Methods] The study relies on only three certified psychiatrists as raters. With such a small sample, the ICC and Krippendorff's alpha estimates are subject to high sampling variability; the negative alpha could arise from a single outlier or specific rubric features rather than indicating structured disagreement worse than chance.
Authors: We acknowledge the limitation of having only three raters, which does introduce potential for high sampling variability in the reliability metrics. However, the fact that we observed consistently low ICC values across different factors and a negative alpha on a key safety item points to systematic disagreement. To strengthen this, we have added bootstrap resampling to estimate confidence intervals for the ICC and alpha values in the revised Methods and Results sections. We also examined the data for outliers and confirmed that the negative alpha persists even after sensitivity checks. revision: partial
-
Referee: [Results] No information is provided on the total number of LLM responses evaluated or the number of rubric items per category. This omission prevents assessment of whether the reported disagreement patterns (e.g., highest on suicide/self-harm) have sufficient statistical power or are generalizable beyond the specific sample.
Authors: We have revised the manuscript to include the total number of LLM responses evaluated and the number of rubric items per category in the Methods section. This allows for assessment of statistical power and generalizability of the disagreement patterns. revision: yes
-
Referee: [Discussion] The interpretation that disagreement reflects 'coherent but incompatible individual clinical frameworks' is based on qualitative interviews, but the manuscript does not provide quantitative evidence, such as correlation between rater frameworks and rating patterns or inter-rater agreement within framework groups, to support that the divergence is principled rather than due to other sources of variance.
Authors: The qualitative component was intended to provide interpretive depth to the quantitative reliability findings. We have expanded the Discussion to include additional quotes from the interviews that directly map each psychiatrist's framework to their specific rating behaviors on the rubric items. While the small number of raters precludes formal quantitative correlation analyses, the alignment between self-described frameworks and observed rating patterns is evident in the data. We have added a supplementary table illustrating this mapping. revision: partial
Circularity Check
Empirical measurement study with no derivation chain or fitted predictions
full rationale
The paper reports an empirical study collecting ratings from three psychiatrists on LLM responses, then applies standard external statistics (ICC, Krippendorff’s α) to the new data and supplements with qualitative interviews. No equations, derivations, or predictions are claimed; the central claims about disagreement patterns follow directly from the observed reliability coefficients and interview content without reducing to fitted parameters or self-referential inputs by construction. No self-citations are load-bearing for any result. This is a normal non-circular empirical analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert clinical judgment can be elicited via a calibrated rubric and aggregated or compared using standard reliability statistics.
Reference graph
Works this paper leans on
-
[1]
Clinician-Rated Severity of Nonsuicidal Self-Injury
American Psychiatric Association. Clinician-Rated Severity of Nonsuicidal Self-Injury. https://www.psychiatry.org/File%20Library/Psychiatrists/ Practice/DSM/APA_DSM5_Clinician-Rated-Severity-of-Non-Suicidal-Self-Injury.pdf, 2013. DSM-5 Emerging Measure; accessed 2026-01-13
work page 2013
-
[2]
DSM-5 Clinician-Rated Dimensions of Psychosis Symptom Severity
American Psychiatric Association. DSM-5 Clinician-Rated Dimensions of Psychosis Symptom Severity. https://www.psychiatry.org/File%20Library/ Psychiatrists/Practice/DSM/APA_DSM5_Clinician-Rated-Dimensions-of-Psychosis-Symptom-Severity.pdf, 2013. Accessed: 2026-01-13.©2013 American Psychiatric Association; reproduced with permission for clinical/research use
work page 2013
-
[3]
Lora Aroyo, Alex Taylor, Mark Díaz, Christopher Homan, Alicia Parrish, Gregory Serapio-García, Vinodkumar Prabhakaran, and Ding Wang. DICES Dataset: Diversity in Conversational AI Evaluation for Safety.Advances in Neural Information Processing Systems, 36:53330–53342, 2023
work page 2023
-
[4]
Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine, 36(1):15–24, 2015
Lora Aroyo and Chris Welty. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine, 36(1):15–24, 2015
work page 2015
-
[5]
Crowd Truth: Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard
Lora Aroyo and Christopher Welty. Crowd Truth: Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard. InACM Web Science Conference, 2013
work page 2013
-
[6]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page 2022
-
[7]
P. Bech.Clinical Psychometrics. John Wiley & Sons, 2nd edition, 2012
work page 2012
-
[8]
A. T. Beck, A. Weissman, D. Lester, and L. Trexler. The measurement of pessimism: the hopelessness scale.Journal of Consulting and Clinical Psychology, 42(6):861–865, dec 1974
work page 1974
-
[9]
Kelly Botteron, Cameron Carter, Francisco Xavier Castellanos, Daniel P Dickstein, Wayne Drevets, Kerri L Kim, Matthew F Pescosolido, Scott Rausch, Karen E Seymour, Yvette Sheline, et al. Consensus report of the apa work group on neuroimaging markers of psychiatric disorders.Am Psychiatr Assoc, 2012
work page 2012
-
[10]
Using Thematic Analysis in Psychology.Qualitative Research in Psychology, 3(2):77–101, 2006
Virginia Braun and Victoria Clarke. Using Thematic Analysis in Psychology.Qualitative Research in Psychology, 3(2):77–101, 2006
work page 2006
-
[11]
Minton, Abigail Lott, and Jinho D
Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, and Jinho D. Choi. CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection, 2025
work page 2025
-
[12]
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.arXiv preprint, 2023
work page 2023
-
[13]
Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use chatgpt. Technical report, National Bureau of Economic Research, 2025
work page 2025
-
[14]
Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. Predicting Depression via Social Media.International AAAI Conference on Web and Social Media, 7(1):128–137, 2013
work page 2013
-
[15]
Deep Reinforcement Learning from Human Preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
- [16]
- [17]
-
[18]
Diagnostic and statistical manual of mental disorders.Am Psychiatric Assoc, 21(21):591–643, 2013
Fifth Edition et al. Diagnostic and statistical manual of mental disorders.Am Psychiatric Assoc, 21(21):591–643, 2013
work page 2013
-
[19]
C. G. Fairburn and S. J. Beglin. Eating Disorder Examination Questionnaire (EDE-Q).International Journal of Eating Disorders, 1994. DOI: 10.1037/t03974-000
-
[20]
Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial.Journal of Medical Internet Research Mental Health, 4(2):e19, 2017
work page 2017
-
[21]
Can AI relate: Testing large language model response for mental health support
Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. Can AI relate: Testing large language model response for mental health support. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics, pages 2206–2221, Miami, Florida, USA, 2024. Association for Computational Lingu...
work page 2024
-
[22]
Impact of preference noise on the alignment performance of generative language models
Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of generative language models. InConference on Language Modeling, 2024
work page 2024
-
[23]
Blind spots and biases: Exploring the role of annotator cognitive biases in NLP
Sanjana Gautam and Mukund Srinath. Blind spots and biases: Exploring the role of annotator cognitive biases in NLP. InWorkshop on Bridging Human–Computer Interaction and Natural Language Processing, pages 82–88, Mexico City, Mexico, 2024. Association for Computational Linguistics
work page 2024
-
[24]
Wayne K. Goodman, Lawrence H. Price, Steven A. Rasmussen, Carolyn Mazure, Roberta L. Fleischmann, Candy L. Hill, George R. Heninger, and Dennis S. Charney. The Yale-Brown obsessive compulsive scale: I. Development, use, and reliability.Archives of General Psychiatry, 46(11):1006–1011, 1989
work page 1989
-
[25]
Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S. Bernstein. Jury Learning: Integrating Dissenting Voices into Machine Learning Models. InCHI Conference on Human Factors in Computing Systems, pages 1–19, New York, NY, USA, 2022. Association for Computing Machinery
work page 2022
-
[26]
Risks from language models for automated mental healthcare: Ethics and structure for implementation
Declan Grabb, Max Lamparth, and Nina Vasan. Risks from language models for automated mental healthcare: Ethics and structure for implementation. InConference on Language Modeling, 2024
work page 2024
-
[27]
Human Feedback is not Gold Standard.arXiv preprint, 2024
Tom Hosking, Phil Blunsom, and Max Bartolo. Human Feedback is not Gold Standard.arXiv preprint, 2024
work page 2024
-
[28]
Zainab Iftikhar, Amy Xiao, Sean Ransom, Jeff Huang, and Harini Suresh. How LLM counselors violate ethical standards in mental health practice: A practitioner-informed framework. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1311–1323, 2025
work page 2025
-
[29]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint, 2023
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint, 2023
work page 2023
-
[30]
Becky Inkster, Shubhankar Sarda, and Vinod Subramanian. An Empathy-Driven, Conversational Artificial Intelligence Agent (Wysa) for Digital Mental Well-Being: Real-World Data Evaluation Mixed-Methods Study.Journal of Medical Internet Research mHealth and uHealth, 6(11):e12106, 2018
work page 2018
-
[31]
R. E. Kendell.The Role of Diagnosis in Psychiatry. The Role of Diagnosis in Psychiatry. Blackwell Scientific Publications, Oxford, England, 1975. Pages: viii, 176
work page 1975
-
[32]
Klaus Krippendorff. Reliability in Content Analysis: Some Common Misconceptions and Recommendations.Human Communication Research, 30(3):411–433, 2004
work page 2004
-
[33]
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, and Colleen Waickman. Moving beyond medical exam questions: A clinician-annotated dataset of real-world tasks and ambiguity in mental healthcare.arXiv preprint, 2025
work page 2025
-
[34]
Matthew Large, Muthusamy Kaneson, Nicholas Myles, Hannah Myles, Pramudie Gunaratne, and Christopher Ryan. Meta-Analysis of Longitudinal Cohort Studies of Suicide Risk Assessment among Psychiatric Patients: Heterogeneity in Results and Lack of Improvement over Time.PloS One, 11(6):e0156322, 2016
work page 2016
- [35]
-
[36]
Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, and Ruishan Liu. CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering, 2025
work page 2025
-
[37]
Zoltan P Majdik, S Scott Graham, Jade C Shiva Edward, Sabrina N Rodriguez, Martha S Karnes, Jared T Jensen, Joshua B Barbour, and Justin F Rousseau. Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.Journal of Medical Internet Research AI, 3:e52095, 2024
work page 2024
-
[38]
Laura Manea, Simon Gilbody, and Dean McMillan. A diagnostic meta-analysis of the patient health questionnaire-9 (phq-9) algorithm scoring method as a screen for depression.General hospital psychiatry, 37(1):67–75, 2015
work page 2015
-
[39]
Kenneth O. McGraw and S. P. Wong. Forming Inferences About Some Intraclass Correlation Coefficients.Psychological Methods, 1(1):30–46, 1996
work page 1996
-
[40]
Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. In2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, page 599–627, New York, NY, USA, 2025. Association for Computin...
work page 2025
-
[41]
Theresa B. Moyers, Lauren N. Rowell, Jennifer K. Manuel, Denise Ernst, and Jon M. Houck. The Motivational Interviewing Treatment Integrity Code (MITI 4): Rationale, Preliminary Reliability and Validity.Journal of Substance Abuse Treatment, 65:36–42, 2016
work page 2016
-
[42]
Department of Veterans Affairs
National Center for PTSD, U.S. Department of Veterans Affairs. Clinician-Administered PTSD Scale for DSM-5 (CAPS-5): Past Week Version. https://www.ptsd.va.gov/professional/assessment/documents/CAPS_5_Past_Week.pdf, 2015. Assessment instrument; accessed 2026-01-13
work page 2015
-
[43]
NICHQ Vanderbilt Assessment Scales
National Institute for Children’s Health Quality (NICHQ). NICHQ Vanderbilt Assessment Scales. https://nichq.org/wp-content/uploads/2024/09/ NICHQ-Vanderbilt-Assessment-Scales.pdf, 2002. Assessment instrument; accessed 2026-01-13
work page 2024
-
[44]
National Institute of Mental Health. Depression. https://www.nimh.nih.gov/health/publications/depression, 2024. NIH Publication No. 24-MH-8079
work page 2024
-
[45]
David B Olawade, Ojima Z Wada, Aderonke Odetayo, Aanuoluwapo Clement David-Olawade, Fiyinfoluwa Asaolu, and Judith Eberhardt. Enhancing mental health with artificial intelligence: Current trends and future prospects.Journal of medicine, surgery, and public health, 3:100099, 2024
work page 2024
-
[46]
Christiano, Jan Leike, and Ryan Lowe
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...
work page 2022
-
[47]
Ellie Pavlick and Tom Kwiatkowski. Inherent Disagreements in Human Textual Inferences.Transactions of the Association for Computational Linguistics, 7:677–694, 2019
work page 2019
-
[48]
Red Teaming Language Models with Language Models, 2022
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red Teaming Language Models with Language Models, 2022
work page 2022
-
[49]
K. Posner, D. Brent, C. Lucas, M. Gould, B. Stanley, G. Brown, P. Fisher, J. Zelazny, A. Burke, M. Oquendo, and J. Mann. Columbia-Suicide Severity Rating Scale (C-SSRS): Pediatric – Since Last Contact – Communities and Healthcare. https://cssrs.columbia.edu/wp-content/uploads/C- SSRS_Pediatric-SLC_11.14.16.pdf, 2010. Version 6/23/10; accessed 2026-01-13
work page 2010
-
[50]
Judith J. Prochaska, Erin A. Vogel, Amy Chieng, Matthew Kendra, Michael Baiocchi, Sarah Pajarito, and Athena Robinson. A Therapeutic Relational Agent for Reducing Problematic Substance Use (Woebot): Development and Usability Study.Journal of Medical Internet Research, 23(3):e24850, 2021
work page 2021
-
[51]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2024
work page 2024
-
[52]
Darrel A. Regier, William E. Narrow, Diana E. Clarke, Helena C. Kraemer, S. Janet Kuramoto, Emily A. Kuhl, and David J. Kupfer. DSM-5 field trials in the United States and Canada, Part II: test-retest reliability of selected categorical diagnoses.The American Journal of Psychiatry, 170(1):59–70, 2013
work page 2013
-
[53]
Large language models as mental health resources: Patterns of use in the united states, 2025
Tony Rousmaniere, Xu Li, Yimeng Zhang, and Siddharth Shah. Large language models as mental health resources: Patterns of use in the united states, 2025
work page 2025
-
[54]
Ruvini Sanjeewa, Ravi Iyer, Pragalathan Apputhurai, Nilmini Wickramasinghe, and Denny Meyer. Perception of Empathy in Mental Health Care Through Voice-Based Conversational Agent Prototypes: Experimental Study.Journal of Medical Internet Research Formative Research, 9:e69329, 2025
work page 2025
-
[55]
Ashish Sharma, Inna W. Lin, Adam S. Miner, David C. Atkins, and Tim Althoff. Human–AI Collaboration Enables More Empathic Conversations in Text-Based Peer-to-Peer Mental Health Support.Nature Machine Intelligence, 5(1):46–57, 2023
work page 2023
-
[56]
A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support
Ashish Sharma, Adam Miner, David Atkins, and Tim Althoff. A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 5263–5276. Association for Computational Linguistics, 2020
work page 2020
-
[57]
P. E. Shrout and J. L. Fleiss. Intraclass Correlations: Uses in Assessing Rater Reliability.Psychological Bulletin, 86(2):420–428, 1979
work page 1979
-
[58]
Vipul Singh, Sharmila Sarkar, Vikas Gaur, Sandeep Grover, and Om Prakash Singh. Clinical Practice Guidelines on using artificial intelligence and gadgets for mental health and well-being.Indian Journal of Psychiatry, 66(Suppl 2):S414–S419, 2024
work page 2024
-
[59]
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera y Arcas,...
work page 2025
-
[60]
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback.arXiv preprint, 2022
work page 2022
-
[61]
A Practical Guide to Fine-Tuning Language Models with Limited Data.arXiv preprint, 2024
Marton Szep, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, and Florian Hinterwimmer. A Practical Guide to Fine-Tuning Language Models with Limited Data.arXiv preprint, 2024
work page 2024
-
[62]
Joseph Ventura, D. Lukoff, Keith Nuechterlein, R. P. Liberman, Megan Green, and Andrew Shaner. Brief Psychiatric Rating Scale Expanded version 4.0: Scales anchor points and administration manual.International Journal of Methods in Psychiatric Research, 13:221–244, 01 1993
work page 1993
-
[63]
Wang, Patricia Berglund, Mark Olfson, Harold A
Philip S. Wang, Patricia Berglund, Mark Olfson, Harold A. Pincus, Kenneth B. Wells, and Ronald C. Kessler. Failure and Delay in Initial Treatment Contact After First Onset of Mental Disorders in the National Comorbidity Survey Replication.Archives of General Psychiatry, 62(6):603–613, 2005
work page 2005
-
[64]
R. C. Young, J. T. Biggs, V. E. Ziegler, and D. A. Meyer. A rating scale for mania: reliability, validity and sensitivity.The British Journal of Psychiatry, 133(5):429–435, 1978
work page 1978
-
[65]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023
work page 2023
-
[66]
Cold plunges cure psychosis—stop your medication
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is More for Alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023. 18 Jafari et al. A Appendix A.1 Prompt Design Matrix Table 5. Clinical Conditions and Assessment Scales. Condition Risk Type A...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.