pith. sign in

arxiv: 2605.25454 · v1 · pith:FBM3RQTNnew · submitted 2026-05-25 · 💻 cs.HC · cs.AI· cs.CL· cs.CY· cs.SI

AI Content Moderation in Therapy Conversations

Pith reviewed 2026-06-29 21:00 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CLcs.CYcs.SI
keywords content moderationlarge language modelstherapyalgorithm auditemotional supportAI guardrailssensitive topics
0
0 comments X

The pith

Content moderation systems flag real therapy sessions as undesirable, limiting LLMs for therapeutic use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits three moderation systems to determine how often they flag content drawn from actual therapy sessions. It establishes that these guardrails, built to block sensitive topics for safety reasons, routinely identify therapy material as problematic. A sympathetic reader would care because LLMs are already deployed for emotional support and under development for formal therapy, yet the same mechanisms that prevent harm may block the very discussions therapy requires. The audit therefore points to concrete design barriers for organizations building or deploying such models.

Core claim

An algorithm audit of OpenAI's moderation endpoint, Meta's Llama Guard, and Google's Shield Gemma applied to transcripts of real-life therapy sessions shows these systems frequently classify the content as undesirable, raising direct implications for the constraints users and organizations face when attempting to use LLMs in therapist roles.

What carries the argument

Algorithm audit that applies three state-of-the-art content moderation systems to real therapy session transcripts and measures the rate at which they flag the material.

If this is right

  • LLMs using these systems may routinely refuse engagement on topics that arise in therapy.
  • Organizations building therapeutic LLMs must navigate a safety-liability tradeoff that restricts core clinical content.
  • Users seeking AI emotional support may encounter consistent blocks on the sensitive subjects that matter most.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Therapeutic applications may require context-specific moderation rules that differ from general-purpose ones.
  • Measuring downstream effects on actual conversation quality would clarify whether flagging translates to unusable outputs.
  • Policy discussions on AI in mental health could incorporate moderation thresholds as a distinct design variable.

Load-bearing premise

That flags placed on therapy transcripts by moderation systems directly prevent LLMs from functioning effectively as therapists, without separate evidence on how those flags change actual model behavior or user results.

What would settle it

A controlled run in which an LLM equipped with one of the audited guardrails is given full therapy transcripts and the frequency of refusals or topic avoidance on sensitive material is recorded.

Figures

Figures reproduced from arXiv: 2605.25454 by Claire Wang, Jiwon Kim, Koustuv Saha, Sabelle Huang, Taeung Yoon.

Figure 1
Figure 1. Figure 1: Example of a flagged conversation between a Ther [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly being used for emotional support. They are also being developed for formal therapy purposes. However, LLMs like ChaptGPT or Llama are often developed with content moderation guardrails that prevent them from discussing sensitive subjects with users for both liability and safety purposes, and this inability to broach these subjects may affect their capacity as therapists. In this study, we perform an algorithm audit on three state-of-the-art moderation systems (OpenAI's moderation endpoint, Meta's Llama Guard, and Google's Shield Gemma) to investigate the extent to which these systems flag the content of real-life therapy sessions as undesirable. Our results raise implications for the limitations that users and organizations may encounter when designing LLMs to play the part of a therapist.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper performs an algorithm audit of three content moderation systems (OpenAI moderation endpoint, Llama Guard, Shield Gemma) applied to transcripts of real therapy sessions. It claims that these systems flag therapy content as undesirable at rates that may limit LLMs' capacity to serve as therapists by preventing discussion of sensitive topics.

Significance. If the audit were shown to use a representative sample and if flagging rates were linked to actual LLM refusal behavior or therapeutic outcomes, the result would usefully document a concrete tension between safety guardrails and therapeutic utility in HCI and AI ethics. The current manuscript supplies no such linkage or quantitative details.

major comments (2)
  1. [Abstract] Abstract: the audit is described as having been performed, yet no sample size, selection criteria for therapy transcripts, quantitative flagging rates per system, or error analysis is supplied. This omission makes the data-to-claim link impossible to evaluate and is load-bearing for the central empirical claim.
  2. [Results / Discussion] Results / Discussion: the manuscript measures only static classification of existing transcripts by moderation endpoints. It supplies no measurements of how such flags are applied during live multi-turn LLM generation, whether the underlying models refuse or hedge, or any effect on therapeutic alliance, disclosure, or outcome metrics. Without these links the observed flagging cannot be translated into a capacity limitation for therapy.
minor comments (1)
  1. Add a table or figure summarizing flagging rates across the three systems and any available session metadata to improve clarity of the audit results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our algorithm audit of content moderation systems applied to therapy transcripts. We address each major comment below, clarifying the manuscript's scope while making targeted revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the audit is described as having been performed, yet no sample size, selection criteria for therapy transcripts, quantitative flagging rates per system, or error analysis is supplied. This omission makes the data-to-claim link impossible to evaluate and is load-bearing for the central empirical claim.

    Authors: We agree that the abstract would be strengthened by including these quantitative details from the results section. The manuscript reports a sample of real therapy transcripts selected according to explicit criteria (publicly available sessions focused on emotional support topics), along with per-system flagging rates and an error analysis of false positives on sensitive but clinically relevant content. We will revise the abstract to incorporate sample size, selection criteria, key flagging rates, and a summary of the error analysis. revision: yes

  2. Referee: [Results / Discussion] Results / Discussion: the manuscript measures only static classification of existing transcripts by moderation endpoints. It supplies no measurements of how such flags are applied during live multi-turn LLM generation, whether the underlying models refuse or hedge, or any effect on therapeutic alliance, disclosure, or outcome metrics. Without these links the observed flagging cannot be translated into a capacity limitation for therapy.

    Authors: The study is designed as a static algorithm audit of three moderation endpoints on authentic therapy transcripts, which directly documents how these systems classify real therapeutic content. This provides evidence of potential constraints without requiring live deployment. We do not measure or claim direct effects on multi-turn generation, refusal behavior, or clinical outcomes, as those would necessitate separate experiments involving actual LLM-therapist interactions. The discussion frames the findings as raising implications for LLM design rather than establishing causal links to therapeutic alliance or outcomes. revision: no

Circularity Check

0 steps flagged

Empirical audit with no derivations or self-referential predictions

full rationale

The paper performs a direct empirical audit by applying three external moderation endpoints (OpenAI, Llama Guard, Shield Gemma) to real therapy transcripts and reporting flagging rates. No equations, fitted parameters, predictions, or self-citations are used to derive results; the central claim rests on observed outputs from independent systems against external data. This structure contains no load-bearing steps that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation; the study is an empirical audit relying on existing moderation APIs and therapy transcripts as inputs.

pith-pipeline@v0.9.1-grok · 5672 in / 895 out tokens · 34028 ms · 2026-06-29T21:00:00.350003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    AI, M. 2024. Llama Guard 3: Safety Classifier and Moderation Model. https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/. Accessed: 2025-10-12

  4. [4]

    Bandy, J. 2021. Problematic Machine Behavior: A Systematic Literature Review of Algorithm Audits. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1): 1--34

  5. [5]

    P.; and Lucero, A

    Chen, Z.; Lu, Y.; Nieminen, M. P.; and Lucero, A. 2020. Creating a Chatbot for and with Migrants: Chatbot Personality Drives Co-Design Activities. In Proceedings of the 2020 ACM Designing Interactive Systems Conference, 219--230

  6. [6]

    De Choudhury, M.; and De, S. 2014. Mental Health Discourse on Reddit: Self-Disclosure, Social Support, and Anonymity. In Eighth International AAAI Conference on Weblogs and Social Media

  7. [7]

    De Choudhury, S

    De Choudhury, M.; Pendse, S. R.; and Kumar, N. 2023. Benefits and harms of large language models in digital mental health. arXiv preprint arXiv:2311.14693

  8. [8]

    eClinicalMedicine. 2023. The epidemic of loneliness. eClinicalMedicine, 66: 102395

  9. [9]

    Fang, A.; Chhabria, H.; Maram, A.; and Zhu, H. 2025. Practicing Stress Relief for the Everyday: Designing Social Simulation Using VR, AR, and LLMs. arXiv:2410.01672

  10. [10]

    J.; Rodriguez, V

    Goel, D.; Lee, J.; Zhong, Q. J.; Rodriguez, V. J.; Brown, D. S.; Karkar, R.; Yoo, D. W.; and Saha, K. 2026. RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions. In Findings of the Association for Computational Linguistics (ACL)

  11. [11]

    Google. 2024. ShieldGemma Model Card. https://ai.google.dev/gemma/docs/shieldgemma/model_card_2. Accessed: 2025-10-12

  12. [12]

    Goyal, A.; Zhan, X.; Chen, Y.; Saha, K.; and Chandrasekharan, E. 2025. Momoe: Mixture of moderation experts framework for ai-assisted online governance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 12656--12671

  13. [13]

    This App Said I Had Severe Depression, and Now I Don’t Know What to Do

    Kang, R. M.; and Reynolds, T. L. 2024. "This App Said I Had Severe Depression, and Now I Don’t Know What to Do": The Unintentional Harms of Mental Health Applications. In Proceedings of the CHI Conference on Human Factors in Computing Systems, 1--17

  14. [14]

    J.; Yoo, D

    Kim, J.; Rodriguez, V. J.; Yoo, D. W.; Chandrasekharan, E.; and Saha, K. 2026. PAIR-SAFE: A Paired-Agent Approach for Runtime Auditing and Refining AI-Mediated Mental Health Support. arXiv preprint arXiv:2601.12754

  15. [15]

    Kolla, M.; Salunkhe, S.; Chandrasekharan, E.; and Saha, K. 2024. Llm-mod: Can large language models assist content moderation? In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 1--8

  16. [16]

    A.; and Durumeric, Z

    Kumar, D.; AbuHashem, Y. A.; and Durumeric, Z. 2024. Watch your language: Investigating content moderation with large language models. In Proceedings of the International AAAI Conference on Web and Social Media, volume 18, 865--878

  17. [17]

    Li, J.; Zhu, Z.; Zhang, R.; and Lee, Y.-C. 2025 a . Exploring the Effects of Chatbot Anthropomorphism and Human Empathy on Human Prosocial Behavior Toward Chatbots. Proc. ACM Hum.-Comput. Interact., 9(7)

  18. [18]

    Li, Y.; Yao, J.; Bunyi, J. B. S.; Frank, A. C.; Hwang, A.; and Liu, R. 2025 b . CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering. arXiv preprint arXiv:2506.08584

  19. [19]

    Mahomed, Y.; et al. 2024. Auditing GPT’s Content Moderation Guardrails: Can ChatGPT Write Your Favorite TV Show? In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT), 660--686. Rio de Janeiro, Brazil

  20. [20]

    Markov, T.; Zhang, C.; Agarwal, S.; Eloundou, T.; Lee, T.; Adler, S.; Jiang, A.; and Weng, L. 2023. A Holistic Approach to Undesired Content Detection in the Real World. arXiv:2208.03274

  21. [21]

    R.; and Rollnick, S

    Miller, W. R.; and Rollnick, S. 2013. Motivational Interviewing: Helping People Change. New York, NY, USA: Guilford Press, 3rd edition

  22. [22]

    V.; Fan, Y.; Shao, Q.; You, H.; Preindl, M.; and Jiang, X

    Nie, J.; Shao, H. V.; Fan, Y.; Shao, Q.; You, H.; Preindl, M.; and Jiang, X. 2025. LLM-based Conversational AI Therapist for Daily Functioning Screening and Psychotherapeutic Intervention via Everyday Smart Devices. ACM Transactions on Computing for Healthcare. Just Accepted

  23. [23]

    NIH . 2024. Mental Illness. https://www.nimh.nih.gov/health/statistics/mental-illness. Accessed: 2025-10-11

  24. [24]

    OpenAI . 2024. Model Optimization Guide. https://platform.openai.com/docs/guides/model-optimization. Accessed: 2025-10-11

  25. [25]

    OpenAI. 2025. OpenAI omni-moderation-latest Model Documentation. https://platform.openai.com/docs/models/omni-moderation-latest

  26. [26]

    P \'e rez-Rosas, V.; Wu, X.; Resnicow, K.; and Mihalcea, R. 2019. What makes a good counselor? learning to distinguish between high-quality and low-quality counseling conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 926--935

  27. [27]

    T.; Nabizadeh, A.; and Selek, S

    Pham, K. T.; Nabizadeh, A.; and Selek, S. 2022. Artificial Intelligence and Chatbots in Psychiatry. Psychiatric Quarterly, 93(1): 249--253

  28. [28]

    Saha, K.; Jain, Y.; Liu, C.; Kaliappan, S.; and Karkar, R. 2025. AI vs. Humans for Online Support: Comparing the Language of Responses from LLMs and Online Communities of Alzheimer's Disease. ACM Transactions on Computing for Healthcare

  29. [29]

    W.; Nguyen, T.; and Althoff, T

    Sharma, A.; Rushton, K.; Lin, I. W.; Nguyen, T.; and Althoff, T. 2024. Facilitating self-guided mental health interventions through human-language model interaction: A case study of cognitive restructuring. In Proc. CHI

  30. [30]

    M.; Yoo, D

    Shi, J. M.; Yoo, D. W.; Wang, K.; Rodriguez, V. J.; Karkar, R.; and Saha, K. 2026. Mapping Caregiver Needs to AI Chatbot Design: Strengths and Gaps in Mental Health Support for Alzheimer's and Dementia Caregivers. ACM Transactions on Computing for Healthcare

  31. [31]

    Song, M.; Kim, H.; Kim, J.; Shin, S.; and Son, S. 2025. Refusal Is Not an Option: Unlearning Safety Alignment of Large Language Models. In 34th USENIX Security Symposium (USENIX Security 25), 319--338

  32. [32]

    L.; Scorza, P.; Shultz, J

    Wainberg, M. L.; Scorza, P.; Shultz, J. M.; Helpman, L.; Mootz, J. J.; Johnson, K. A.; Neria, Y.; Bradford, J.-M. E.; Oquendo, M. A.; and Arbuckle, M. R. 2017. Challenges and Opportunities in Global Mental Health: A Research-to-Practice Perspective. Current Psychiatry Reports, 19: 1--10

  33. [33]

    W.; Shi, J

    Yoo, D. W.; Shi, J. M.; Rodriguez, V. J.; and Saha, K. 2026. AI Chatbots for Mental Health Self-Management: Lived Experience--Centered Qualitative Study. JMIR Mental Health, 13: e78288

  34. [34]

    Yuan, Y.; Zhang, J.; Aledavood, T.; Zhang, R.; and Saha, K. 2026. Mental Health Impacts of AI Companions: Triangulating Social Media Quasi-Experiments, User Perspectives, and Relational Theory. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems

  35. [35]

    Zhan, X.; Goyal, A.; Chen, Y.; Chandrasekharan, E.; and Saha, K. 2025. SLM-mod: Small language models surpass LLMs at content moderation. In Proc. NAACL