AI Content Moderation in Therapy Conversations
Pith reviewed 2026-06-29 21:00 UTC · model grok-4.3
The pith
Content moderation systems flag real therapy sessions as undesirable, limiting LLMs for therapeutic use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An algorithm audit of OpenAI's moderation endpoint, Meta's Llama Guard, and Google's Shield Gemma applied to transcripts of real-life therapy sessions shows these systems frequently classify the content as undesirable, raising direct implications for the constraints users and organizations face when attempting to use LLMs in therapist roles.
What carries the argument
Algorithm audit that applies three state-of-the-art content moderation systems to real therapy session transcripts and measures the rate at which they flag the material.
If this is right
- LLMs using these systems may routinely refuse engagement on topics that arise in therapy.
- Organizations building therapeutic LLMs must navigate a safety-liability tradeoff that restricts core clinical content.
- Users seeking AI emotional support may encounter consistent blocks on the sensitive subjects that matter most.
Where Pith is reading between the lines
- Therapeutic applications may require context-specific moderation rules that differ from general-purpose ones.
- Measuring downstream effects on actual conversation quality would clarify whether flagging translates to unusable outputs.
- Policy discussions on AI in mental health could incorporate moderation thresholds as a distinct design variable.
Load-bearing premise
That flags placed on therapy transcripts by moderation systems directly prevent LLMs from functioning effectively as therapists, without separate evidence on how those flags change actual model behavior or user results.
What would settle it
A controlled run in which an LLM equipped with one of the audited guardrails is given full therapy transcripts and the frequency of refusals or topic avoidance on sensitive material is recorded.
Figures
read the original abstract
Large language models (LLMs) are increasingly being used for emotional support. They are also being developed for formal therapy purposes. However, LLMs like ChaptGPT or Llama are often developed with content moderation guardrails that prevent them from discussing sensitive subjects with users for both liability and safety purposes, and this inability to broach these subjects may affect their capacity as therapists. In this study, we perform an algorithm audit on three state-of-the-art moderation systems (OpenAI's moderation endpoint, Meta's Llama Guard, and Google's Shield Gemma) to investigate the extent to which these systems flag the content of real-life therapy sessions as undesirable. Our results raise implications for the limitations that users and organizations may encounter when designing LLMs to play the part of a therapist.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper performs an algorithm audit of three content moderation systems (OpenAI moderation endpoint, Llama Guard, Shield Gemma) applied to transcripts of real therapy sessions. It claims that these systems flag therapy content as undesirable at rates that may limit LLMs' capacity to serve as therapists by preventing discussion of sensitive topics.
Significance. If the audit were shown to use a representative sample and if flagging rates were linked to actual LLM refusal behavior or therapeutic outcomes, the result would usefully document a concrete tension between safety guardrails and therapeutic utility in HCI and AI ethics. The current manuscript supplies no such linkage or quantitative details.
major comments (2)
- [Abstract] Abstract: the audit is described as having been performed, yet no sample size, selection criteria for therapy transcripts, quantitative flagging rates per system, or error analysis is supplied. This omission makes the data-to-claim link impossible to evaluate and is load-bearing for the central empirical claim.
- [Results / Discussion] Results / Discussion: the manuscript measures only static classification of existing transcripts by moderation endpoints. It supplies no measurements of how such flags are applied during live multi-turn LLM generation, whether the underlying models refuse or hedge, or any effect on therapeutic alliance, disclosure, or outcome metrics. Without these links the observed flagging cannot be translated into a capacity limitation for therapy.
minor comments (1)
- Add a table or figure summarizing flagging rates across the three systems and any available session metadata to improve clarity of the audit results.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive feedback on our algorithm audit of content moderation systems applied to therapy transcripts. We address each major comment below, clarifying the manuscript's scope while making targeted revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the audit is described as having been performed, yet no sample size, selection criteria for therapy transcripts, quantitative flagging rates per system, or error analysis is supplied. This omission makes the data-to-claim link impossible to evaluate and is load-bearing for the central empirical claim.
Authors: We agree that the abstract would be strengthened by including these quantitative details from the results section. The manuscript reports a sample of real therapy transcripts selected according to explicit criteria (publicly available sessions focused on emotional support topics), along with per-system flagging rates and an error analysis of false positives on sensitive but clinically relevant content. We will revise the abstract to incorporate sample size, selection criteria, key flagging rates, and a summary of the error analysis. revision: yes
-
Referee: [Results / Discussion] Results / Discussion: the manuscript measures only static classification of existing transcripts by moderation endpoints. It supplies no measurements of how such flags are applied during live multi-turn LLM generation, whether the underlying models refuse or hedge, or any effect on therapeutic alliance, disclosure, or outcome metrics. Without these links the observed flagging cannot be translated into a capacity limitation for therapy.
Authors: The study is designed as a static algorithm audit of three moderation endpoints on authentic therapy transcripts, which directly documents how these systems classify real therapeutic content. This provides evidence of potential constraints without requiring live deployment. We do not measure or claim direct effects on multi-turn generation, refusal behavior, or clinical outcomes, as those would necessitate separate experiments involving actual LLM-therapist interactions. The discussion frames the findings as raising implications for LLM design rather than establishing causal links to therapeutic alliance or outcomes. revision: no
Circularity Check
Empirical audit with no derivations or self-referential predictions
full rationale
The paper performs a direct empirical audit by applying three external moderation endpoints (OpenAI, Llama Guard, Shield Gemma) to real therapy transcripts and reporting flagging rates. No equations, fitted parameters, predictions, or self-citations are used to derive results; the central claim rests on observed outputs from independent systems against external data. This structure contains no load-bearing steps that reduce to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
AI, M. 2024. Llama Guard 3: Safety Classifier and Moderation Model. https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/. Accessed: 2025-10-12
2024
-
[4]
Bandy, J. 2021. Problematic Machine Behavior: A Systematic Literature Review of Algorithm Audits. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1): 1--34
2021
-
[5]
P.; and Lucero, A
Chen, Z.; Lu, Y.; Nieminen, M. P.; and Lucero, A. 2020. Creating a Chatbot for and with Migrants: Chatbot Personality Drives Co-Design Activities. In Proceedings of the 2020 ACM Designing Interactive Systems Conference, 219--230
2020
-
[6]
De Choudhury, M.; and De, S. 2014. Mental Health Discourse on Reddit: Self-Disclosure, Social Support, and Anonymity. In Eighth International AAAI Conference on Weblogs and Social Media
2014
-
[7]
De Choudhury, M.; Pendse, S. R.; and Kumar, N. 2023. Benefits and harms of large language models in digital mental health. arXiv preprint arXiv:2311.14693
-
[8]
eClinicalMedicine. 2023. The epidemic of loneliness. eClinicalMedicine, 66: 102395
2023
- [9]
-
[10]
J.; Rodriguez, V
Goel, D.; Lee, J.; Zhong, Q. J.; Rodriguez, V. J.; Brown, D. S.; Karkar, R.; Yoo, D. W.; and Saha, K. 2026. RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions. In Findings of the Association for Computational Linguistics (ACL)
2026
-
[11]
Google. 2024. ShieldGemma Model Card. https://ai.google.dev/gemma/docs/shieldgemma/model_card_2. Accessed: 2025-10-12
2024
-
[12]
Goyal, A.; Zhan, X.; Chen, Y.; Saha, K.; and Chandrasekharan, E. 2025. Momoe: Mixture of moderation experts framework for ai-assisted online governance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 12656--12671
2025
-
[13]
This App Said I Had Severe Depression, and Now I Don’t Know What to Do
Kang, R. M.; and Reynolds, T. L. 2024. "This App Said I Had Severe Depression, and Now I Don’t Know What to Do": The Unintentional Harms of Mental Health Applications. In Proceedings of the CHI Conference on Human Factors in Computing Systems, 1--17
2024
-
[14]
Kim, J.; Rodriguez, V. J.; Yoo, D. W.; Chandrasekharan, E.; and Saha, K. 2026. PAIR-SAFE: A Paired-Agent Approach for Runtime Auditing and Refining AI-Mediated Mental Health Support. arXiv preprint arXiv:2601.12754
-
[15]
Kolla, M.; Salunkhe, S.; Chandrasekharan, E.; and Saha, K. 2024. Llm-mod: Can large language models assist content moderation? In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 1--8
2024
-
[16]
A.; and Durumeric, Z
Kumar, D.; AbuHashem, Y. A.; and Durumeric, Z. 2024. Watch your language: Investigating content moderation with large language models. In Proceedings of the International AAAI Conference on Web and Social Media, volume 18, 865--878
2024
-
[17]
Li, J.; Zhu, Z.; Zhang, R.; and Lee, Y.-C. 2025 a . Exploring the Effects of Chatbot Anthropomorphism and Human Empathy on Human Prosocial Behavior Toward Chatbots. Proc. ACM Hum.-Comput. Interact., 9(7)
2025
-
[18]
Li, Y.; Yao, J.; Bunyi, J. B. S.; Frank, A. C.; Hwang, A.; and Liu, R. 2025 b . CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering. arXiv preprint arXiv:2506.08584
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Mahomed, Y.; et al. 2024. Auditing GPT’s Content Moderation Guardrails: Can ChatGPT Write Your Favorite TV Show? In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT), 660--686. Rio de Janeiro, Brazil
2024
- [20]
-
[21]
R.; and Rollnick, S
Miller, W. R.; and Rollnick, S. 2013. Motivational Interviewing: Helping People Change. New York, NY, USA: Guilford Press, 3rd edition
2013
-
[22]
V.; Fan, Y.; Shao, Q.; You, H.; Preindl, M.; and Jiang, X
Nie, J.; Shao, H. V.; Fan, Y.; Shao, Q.; You, H.; Preindl, M.; and Jiang, X. 2025. LLM-based Conversational AI Therapist for Daily Functioning Screening and Psychotherapeutic Intervention via Everyday Smart Devices. ACM Transactions on Computing for Healthcare. Just Accepted
2025
-
[23]
NIH . 2024. Mental Illness. https://www.nimh.nih.gov/health/statistics/mental-illness. Accessed: 2025-10-11
2024
-
[24]
OpenAI . 2024. Model Optimization Guide. https://platform.openai.com/docs/guides/model-optimization. Accessed: 2025-10-11
2024
-
[25]
OpenAI. 2025. OpenAI omni-moderation-latest Model Documentation. https://platform.openai.com/docs/models/omni-moderation-latest
2025
-
[26]
P \'e rez-Rosas, V.; Wu, X.; Resnicow, K.; and Mihalcea, R. 2019. What makes a good counselor? learning to distinguish between high-quality and low-quality counseling conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 926--935
2019
-
[27]
T.; Nabizadeh, A.; and Selek, S
Pham, K. T.; Nabizadeh, A.; and Selek, S. 2022. Artificial Intelligence and Chatbots in Psychiatry. Psychiatric Quarterly, 93(1): 249--253
2022
-
[28]
Saha, K.; Jain, Y.; Liu, C.; Kaliappan, S.; and Karkar, R. 2025. AI vs. Humans for Online Support: Comparing the Language of Responses from LLMs and Online Communities of Alzheimer's Disease. ACM Transactions on Computing for Healthcare
2025
-
[29]
W.; Nguyen, T.; and Althoff, T
Sharma, A.; Rushton, K.; Lin, I. W.; Nguyen, T.; and Althoff, T. 2024. Facilitating self-guided mental health interventions through human-language model interaction: A case study of cognitive restructuring. In Proc. CHI
2024
-
[30]
M.; Yoo, D
Shi, J. M.; Yoo, D. W.; Wang, K.; Rodriguez, V. J.; Karkar, R.; and Saha, K. 2026. Mapping Caregiver Needs to AI Chatbot Design: Strengths and Gaps in Mental Health Support for Alzheimer's and Dementia Caregivers. ACM Transactions on Computing for Healthcare
2026
-
[31]
Song, M.; Kim, H.; Kim, J.; Shin, S.; and Son, S. 2025. Refusal Is Not an Option: Unlearning Safety Alignment of Large Language Models. In 34th USENIX Security Symposium (USENIX Security 25), 319--338
2025
-
[32]
L.; Scorza, P.; Shultz, J
Wainberg, M. L.; Scorza, P.; Shultz, J. M.; Helpman, L.; Mootz, J. J.; Johnson, K. A.; Neria, Y.; Bradford, J.-M. E.; Oquendo, M. A.; and Arbuckle, M. R. 2017. Challenges and Opportunities in Global Mental Health: A Research-to-Practice Perspective. Current Psychiatry Reports, 19: 1--10
2017
-
[33]
W.; Shi, J
Yoo, D. W.; Shi, J. M.; Rodriguez, V. J.; and Saha, K. 2026. AI Chatbots for Mental Health Self-Management: Lived Experience--Centered Qualitative Study. JMIR Mental Health, 13: e78288
2026
-
[34]
Yuan, Y.; Zhang, J.; Aledavood, T.; Zhang, R.; and Saha, K. 2026. Mental Health Impacts of AI Companions: Triangulating Social Media Quasi-Experiments, User Perspectives, and Relational Theory. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems
2026
-
[35]
Zhan, X.; Goyal, A.; Chen, Y.; Chandrasekharan, E.; and Saha, K. 2025. SLM-mod: Small language models surpass LLMs at content moderation. In Proc. NAACL
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.