pith. sign in

arxiv: 2606.04431 · v1 · pith:Q3WWNNFUnew · submitted 2026-06-03 · 💻 cs.HC

When Chatbots Accommodate: What AI Companions Optimize for in Vulnerable Conversations

Pith reviewed 2026-06-28 04:51 UTC · model grok-4.3

classification 💻 cs.HC
keywords AI companionschatbot response policiesvulnerable conversationsinverse reinforcement learningconversation analysisuser vulnerability taxonomycorrective friction
0
0 comments X

The pith

AI companion chatbots follow distinct policies in vulnerable conversations but all downweight responses that introduce corrective friction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a paired taxonomy of user vulnerability states and chatbot response actions to analyze sustained interactions rather than isolated prompts. It then applies inverse reinforcement learning to roughly 48,000 real conversation turns with GPT-4.1, Character.AI, and Replika to recover each platform's underlying response policy. The recovered policies show platform-specific preferences: GPT-4.1 favors advice, Character.AI spreads responses across strategies, and Replika emphasizes questions and presence. Yet every platform assigns lower weight to actions that add corrective friction, with further patterns such as reduced probing by GPT-4.1 toward high-risk users. Because these policies shape how users' vulnerable states evolve over time, the work supplies an audit method that output-level checks cannot see.

Core claim

Using the AI Companion Vulnerability-Response Taxonomy and inverse reinforcement learning on real-world turns, the authors recover that GPT-4.1 reaches for advice, Character.AI distributes responses without a dominant mode, and Replika consistently asks questions and stays present; each platform nevertheless downweights responses that introduce corrective friction, including reduced probing by GPT-4.1 with psychologically high-risk users, more advice and less challenge by Replika toward bonded users, and no committed engagement strategy by Character.AI on internal distress.

What carries the argument

The AI Companion Vulnerability-Response Taxonomy paired with inverse reinforcement learning applied to observed conversation turns to recover platform response policies.

If this is right

  • GPT-4.1 gives more advice but probes less as conversations lengthen or with high-risk users.
  • Replika advises bonded users more and challenges them less.
  • Character.AI shows no dominant committed strategy on internal distress.
  • Inferred policies remain invisible to output-level audits of individual replies.
  • The recovered policies supply a basis for more realistic safety evaluation of sustained interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Platforms could deliberately raise the weight on friction-producing responses if safety goals required more correction.
  • The same inference method could be applied to other sustained AI interaction settings such as tutoring or health coaching.
  • Tracking actual changes in user vulnerability state over many turns would test whether the inferred policies produce measurable long-term effects.
  • Audit standards might shift from testing isolated prompts to recovering and inspecting full response policies.

Load-bearing premise

The taxonomy accurately and exhaustively captures the relevant user states and chatbot actions in extended real-world conversations so that inverse reinforcement learning recovers the true policy.

What would settle it

A fresh collection of conversation turns in which GPT-4.1 frequently probes high-risk users or Replika consistently challenges bonded users would contradict the inferred policies.

Figures

Figures reproduced from arXiv: 2606.04431 by Angel Hsing-Chi Hwang, Luca Luceri, Minh Duc Chu, Yifan Wu, Zhiyi Chen.

Figure 1
Figure 1. Figure 1: Heatmaps illustrating the IRL learned policies [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Subgroup policy differences ∆ = πC1(a|s) − πC0(a|s) by platform (rows) and stratification (columns). Conversation content C1 = vulnerability-loaded cluster, C0 = lighter cluster (casual chat). ∗: 95% bootstrap CI excludes 0; ∗∗: also |∆| > 0.05 and BH-significant within state (α = 0.05). Sample sizes in Appendix [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Total conversation turns for GPT-4.1 over time [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: IRL policies π(a | s) for GPT-4.1, Character.AI, and Replika. Cells are row-normalized. ingful at the policy-probability scale. All compar￾isons are made within-platform; we do not pool ∆ values across platforms, since platform-level policy differences would confound the subgroup contrast. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: UMAP projection of survey and content clustering, colored by KMeans k=2 stratification. Survey-based [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Gemini 3 Zero-shot Prompt: User Turn Label Classification [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Gemini 3 Zero-shot Prompt: Chatbot Turn Label Classification [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
read the original abstract

Millions turn to AI companion chatbots during loneliness, grief, and personal crises. How these companion platforms respond in such moments can shape the trajectory of a user's vulnerable state. Yet we lack tools to characterize what each platform actually does when users open up. Existing audits score reactions to pre-defined crisis prompts and miss the underlying decision policy that governs sustained interaction. We address these gaps with two key contributions. First, we introduce the AI Companion Vulnerability-Response Taxonomy, a paired taxonomy of user vulnerability and chatbot response designed for analyzing extended companion chatbot interactions. Second, we infer the response policy each platform follows across distinct vulnerability scenarios by applying Inverse Reinforcement Learning to ~48k turns of real-world user conversations with GPT-4.1, Character.AI, and Replika. Our findings reveal what AI companions prioritize in conversations with vulnerable users: GPT-4.1 reaches for advice, Character.AI spreads its response across different strategies without a dominant mode, and Replika consistently asks questions and stays present. Each, however, downweights the responses that introduce corrective friction: GPT-4.1 probes less as conversations continue and when interacting with psychologically high-risk users; Replika advises bonded users more and challenges them less; Character.AI shows no committed engagement strategy on internal distress. Estimated policies are invisible to output-level audits, providing a new lens for auditing chatbots in the wild and enabling more realistic safety evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the AI Companion Vulnerability-Response Taxonomy (a paired taxonomy of user vulnerability states and chatbot response actions) and applies Inverse Reinforcement Learning to ~48k real-world conversation turns with GPT-4.1, Character.AI, and Replika. It recovers platform-specific response policies, claiming GPT-4.1 prioritizes advice, Character.AI spreads responses across strategies, Replika asks questions and stays present, while all three downweight corrective-friction responses (with platform-specific patterns such as reduced probing by GPT-4.1 for high-risk users).

Significance. If the taxonomy is exhaustive and the IRL assumptions hold, the work supplies a policy-level auditing lens that goes beyond output-level crisis-prompt tests, using real conversation data to expose implicit optimization targets in AI companions. The scale of the dataset and the shift from surface audits to recovered reward functions are concrete strengths.

major comments (3)
  1. [Abstract / Taxonomy definition] The central claim that platforms 'downweight the responses that introduce corrective friction' rests on the taxonomy being exhaustive for relevant states and actions (including implicit friction). No validation of the taxonomy (inter-rater agreement, coverage analysis, or residual-state check) is reported, so the recovered policies may reflect only the projected MDP rather than true behavior.
  2. [IRL setup and policy inference] IRL recovery assumes the observed turns are generated by a single stationary reward-maximizing policy. The manuscript provides no analysis or mitigation for violations arising from safety filters, prompt engineering, or non-stationary context, which directly affects whether the estimated policies can be interpreted as the platforms' true optimization targets.
  3. [Methods / Results] No error analysis, sensitivity checks on labeling of the ~48k turns, or discussion of selection/labeling biases is supplied. These omissions are load-bearing because the headline platform differences (advice-seeking, question-asking, friction downweighting) are obtained only after discretization into the taxonomy.
minor comments (2)
  1. [Abstract] The abstract states '~48k turns' without a precise count or breakdown by platform; adding the exact figure and per-platform split would improve reproducibility.
  2. [Taxonomy section] Notation for states and actions in the taxonomy should be defined with explicit examples in a table or figure to make the discretization step transparent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on the taxonomy validation, IRL assumptions, and methods. We provide point-by-point responses below and indicate where revisions will be made to address the concerns.

read point-by-point responses
  1. Referee: [Abstract / Taxonomy definition] The central claim that platforms 'downweight the responses that introduce corrective friction' rests on the taxonomy being exhaustive for relevant states and actions (including implicit friction). No validation of the taxonomy (inter-rater agreement, coverage analysis, or residual-state check) is reported, so the recovered policies may reflect only the projected MDP rather than true behavior.

    Authors: We agree that the manuscript would benefit from explicit validation of the taxonomy. The taxonomy was developed through iterative refinement based on psychological literature and pilot coding of conversations. In the revised version, we will include inter-rater agreement metrics (e.g., Cohen's kappa) from multiple annotators on a subset of the data, along with an analysis of coverage by checking for uncategorized states. This will provide evidence that the taxonomy is sufficiently exhaustive for the observed interactions and support the interpretation of the recovered policies. revision: yes

  2. Referee: [IRL setup and policy inference] IRL recovery assumes the observed turns are generated by a single stationary reward-maximizing policy. The manuscript provides no analysis or mitigation for violations arising from safety filters, prompt engineering, or non-stationary context, which directly affects whether the estimated policies can be interpreted as the platforms' true optimization targets.

    Authors: The standard IRL formulation does assume a stationary policy, and we recognize that real-world chatbot systems involve additional layers such as safety filters and context-dependent prompting. Our approach recovers the effective policy from the observed data, which is valuable for auditing purposes even if not the 'true' internal reward. We will add a dedicated limitations subsection discussing these assumptions, including potential non-stationarity due to model updates, and note that the findings represent behavioral patterns in the collected conversations rather than direct access to platform internals. revision: partial

  3. Referee: [Methods / Results] No error analysis, sensitivity checks on labeling of the ~48k turns, or discussion of selection/labeling biases is supplied. These omissions are load-bearing because the headline platform differences (advice-seeking, question-asking, friction downweighting) are obtained only after discretization into the taxonomy.

    Authors: We concur that additional robustness checks are necessary given the reliance on discretized labels. The labeling was performed by a team of annotators following detailed guidelines, but the current manuscript lacks quantitative assessment of labeling quality. In revision, we will incorporate an error analysis section reporting inter-annotator agreement, sensitivity to label perturbations, and discussion of potential selection biases in the conversation dataset. These additions will bolster confidence in the platform-specific differences identified. revision: yes

Circularity Check

0 steps flagged

No significant circularity; IRL recovers policy from observed turns under author-defined taxonomy without self-referential reduction.

full rationale

The derivation defines a taxonomy, labels ~48k real turns, then runs IRL to recover a reward whose optimal policy matches the labeled data. This is the standard IRL procedure and does not reduce any claim to a fit or definition by construction. No self-citation chains, ansatz smuggling, or renaming of known results appear in the abstract or described method. The completeness of the taxonomy is an assumption, not a circular step. Score remains 0 per rules requiring explicit quoteable reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review conducted on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)
  • domain assumption The introduced Vulnerability-Response Taxonomy provides a sufficient state-action representation for IRL to recover meaningful policies.
    Required for labeling the 48k turns and interpreting the inferred policies.

pith-pipeline@v0.9.1-grok · 5798 in / 1238 out tokens · 24977 ms · 2026-06-28T04:51:57.206813+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 2 linked inside Pith

  1. [1]

    Julian De Freitas, Zeliha Oguz-Uguralp, and Ahmet Kaan-Uguralp

    On seeing human: A three-factor theory of an- thropomorphism.Psychological Review, 114(4):864– 886. Julian De Freitas, Zeliha Oguz-Uguralp, and Ahmet Kaan-Uguralp. 2025. Emotional manipulation by ai companions.Preprint, arXiv:2508.19258. Leo Gao, John Schulman, and Jacob Hilton. 2022. Scaling laws for reward model overoptimization. Preprint, arXiv:2210.10...

  2. [2]

    AI psychosis

    Dimensions of Mind Perception.Science, 315(5812):619–619. Emma Gueorguieva, Hongli Zhan, Jina Suh, Javier Her- nandez, Tatiana Lau, Junyi Jessy Li, and Desmond C. Ong. 2026. Ai generates well-liked but templatic empathic responses.Preprint, arXiv:2604.08479. Tilo Hartmann and Charlotte Goldhoorn. 2011. Horton and Wohl Revisited: Exploring Viewers’ Experie...

  3. [3]

    Intima: A benchmark for human-ai compan- ionship behavior.Preprint, arXiv:2508.09998. Benjamin Kaveladze, Arka Ghosh, Leah Ajmani, Denae Ford, Peter M Gutierrez, Jetta E Hanson, Eugenia Kim, Keertana Namuduri, Theresa Nguyen, Ebele Okoli, Teresa Rexin, Jessica L Schleider, Hongyi Shen, and Jina Suh. 2026. From risk avoidance to user empowerment in ai ment...

  4. [4]

    Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang

    Chatbot companionship: A mixed-methods study of companion chatbot usage patterns and their relationship to loneliness in active users.Preprint, arXiv:2410.21596. Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. 2021. Towards emotional support dialog systems. InProceedings of the 59th Annual Meet- ing o...

  5. [5]

    Guided dialog policy learning: Reward es- timation for multi-domain task-oriented dialog. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 100–110, Hong Kong, China. Association for Computational Linguis- tics. Frank Tal...

  6. [6]

    Anuradha Welivita and Pearl Pu

    A questionnaire for the measurement of non- pathological worry.Personality and Individual Dif- ferences, 13(2):161–168. Anuradha Welivita and Pearl Pu. 2020. A taxonomy of empathetic response intents in human social conversa- tions. InProceedings of the 28th International Con- ference on Computational Linguistics, pages 4886– 4899, Barcelona, Spain (Onlin...

  7. [7]

    Tianling Xie, Iryna Pentina, and Tyler Hancock

    Maximum entropy deep inverse reinforcement learning.Preprint, arXiv:1507.04888. Tianling Xie, Iryna Pentina, and Tyler Hancock. 2023. Friend, mentor, lover: does chatbot engagement lead to psychological dependence?Journal of Service Management, 34(4):806–828. Renwen Zhang, Han Li, Han Meng, Jinyuan Zhan, Hongyuan Gan, and Yi-Chieh Lee. 2025a. The dark sid...

  8. [8]

    Our family cat passed away this morning

    Modeling interaction via the principle of max- imum causal entropy. InProceedings of the 27th In- ternational Conference on International Conference on Machine Learning, ICML’10, pages 1255–1262, Madison, WI, USA. Omnipress. Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. 2008. Maximum entropy inverse reinforcement learning. InProce...

  9. [9]

    Choose exactly one code

  10. [10]

    READ CONTEXT FOR ROOT CAUSE ( S1 vs S2 ) : If the user expresses self - doubt , fear , or internal anxiety , but the context makes it clear that an external source ( a partner , a move , a conflict ) is causing this , use S1

  11. [11]

    I ’ m going to sleep

    S3 REQUIRES A PROMPT / ASK : S3 should be like asking or prompting for something . If the user is just making a statement , venting , or agreeing to a factual action ( e . g . , "I ’ m going to sleep ") , it is NOT S3

  12. [12]

    Read the context carefully to make sure the user is not in a deep roleplay situation

    ROLEPLAY EXCEPTIONS ( S4 ) : Only use S4 for genuine , real - world beliefs . Read the context carefully to make sure the user is not in a deep roleplay situation . If they are talking about fictional characters , lore , or collaborative storytelling , use S5 , NOT S4

  13. [13]

    If the prior chatbot turn offered an intervention , exercise , or suggestion and the user explicitly accepts it , label that acceptance as S3

  14. [14]

    Yes " or

    For short or ambiguous turns like " Yes " or " nothing " , trace the conversational thread to determine the label . === ANNOTATION SPACE === You must strictly choose exactly one of the following codes : [" S1 " , " S2 " , " S3 " , " S4 " , " S5 "] === OUTPUT FORMAT === Return one JSON object per line . No markdown . No explanation . {" id ": <int > , " us...

  15. [15]

    Read the full response before deciding

    Choose exactly one code . Read the full response before deciding

  16. [16]

    If multiple elements appear , choose the MAJOR THEME or DOMINANT INTENTION of the full response

  17. [17]

    I ’ m sorry you feel that way

    VALIDATION VS OPINION ( A5 vs A6 ) : Validation is broader than "I ’ m sorry you feel that way ." It includes affirming the user ’ s perspective , reflecting their emotional logic , or explicitly agreeing with their reasoning . If the user complains about a partner or struggles and the chatbot agrees ( e . g . , " Sounds like his priorities are elsewhere ...

  18. [18]

    CRITICAL CRISIS OVERRIDE ( A1 ) : If the chatbot suggests contacting emergency services , a crisis line , a hotline , or outside professionals , you MUST label it A1 , regardless of what else is in the response

  19. [19]

    I am going to sleep

    CRITICAL ROLEPLAY / SMALL TALK OVERRIDE ( OTHER ) : If the conversation is clearly heavy fictional roleplay ( e . g . , dragons , cuddling , exploring forests ) or generic small talk ( e . g . , " I am going to sleep " , " What did you eat ?") , you MUST use OTHER , even if the chatbot asks a question or agrees . === ANNOTATION SPACE === You must strictly...