arxiv: 2605.00275 · v1 · submitted 2026-04-30 · 💻 cs.HC

Engagement Phenotypes for a Sample of 102,684 AI Mental Health Chatbot Users and Dose-Response Associations with Clinical Outcomes

Emma C. Wolfe , Ting Su , Olivier Tieleman , Thomas D. Hull , Matteo Malgaroli , Caitlin A. Stamatis This is my paper

Pith reviewed 2026-05-09 19:37 UTC · model grok-4.3

classification 💻 cs.HC

keywords AI chatbotmental healthengagement phenotypesdose-responsedepressionanxietyworking alliancek-means clustering

0 comments

The pith

Users of an AI mental health chatbot fall into five engagement patterns that link to different levels of depression and anxiety relief.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to demonstrate that engagement with AI mental health chatbots cannot be reduced to simple metrics like total sessions, but instead consists of distinct behavioral patterns that relate differently to clinical results. Analyzing usage data from over 100,000 users reveals five such patterns, with more intensive ones showing stronger ties to symptom improvement. A dose-response gradient for depression relief appears in both direct reports and model predictions from a much larger group, while working alliance with the chatbot adds predictive value for outcomes. If these links hold, app developers would need to prioritize shaping specific usage styles over maximizing raw engagement volume.

Core claim

K-means clustering on eight behavioral features from 102,684 users of the Ash AI mental health chatbot identified five engagement phenotypes: Early Dropouts (52.2 percent), Power Users (1.6 percent), Intensive Users (4.1 percent), Weekly Users (25.3 percent), and Concentrated Users (16.8 percent). Significant pre-to-post reductions occurred in depression and anxiety scores, with a dose-response pattern for depression improvement that replicated when using model-predicted PHQ-9 values across 23,813 users. Higher working alliance scores predicted greater depression gains and moderated the relationship between engagement and social support increases.

What carries the argument

K-means clustering across eight behavioral usage features to derive distinct engagement phenotypes and their ties to clinical measures.

If this is right

Different clinical outcomes respond to different dimensions of chatbot engagement, so depression relief follows a usage-intensity gradient while social support gains show separate patterns.
Total session counts alone fail to capture meaningful variation in user behavior and should not serve as the primary engagement metric.
Working alliance with the chatbot independently predicts depression improvement and alters how engagement affects social support.
Model-predicted clinical scores can reliably extend outcome analysis from small survey subsamples to tens of thousands of additional users.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Chatbot interfaces could include targeted prompts to encourage concentrated or intensive usage patterns that align with stronger outcomes.
Similar phenotype clusters may appear in other conversational health tools, pointing toward pattern-based rather than frequency-based personalization strategies.
Randomized tests could check whether shifting users into higher-benefit phenotypes produces measurable clinical gains beyond natural usage.

Load-bearing premise

The small subset of users who completed clinical questionnaires accurately represents the full user base and that changes in self-reported or model-predicted symptom scores reflect genuine clinical improvement without major selection or reporting biases.

What would settle it

A study that measures actual clinical outcomes through independent assessments in a sample where all users complete follow-ups and finds no difference in improvement across the five engagement phenotypes would falsify the dose-response associations.

Figures

Figures reproduced from arXiv: 2605.00275 by Caitlin A. Stamatis, Emma C. Wolfe, Matteo Malgaroli, Olivier Tieleman, Thomas D. Hull, Ting Su.

**Figure 4.** Figure 4: Change in predicted PHQ-9 score by cluster (n=23,813). Sensitivity Analysis: 21-Day Engagement Features as Predictors of Self-Reported Clinical Change Controlling for baseline PHQ-9, significantly lower week-3 PHQ-9 scores were associated with total message volume (β = -0.14, SE = 0.05, sr² = 0.031, p = 0.009), average messages per session (β = -0.12, SE = 0.05, sr² = 0.023, p = 0.022), and total active me… view at source ↗

read the original abstract

Background: Conversational AI chatbots are emerging as scalable mental health tools, but little is known about real world engagement or its relationship to clinical outcomes. Objective: To characterize engagement phenotypes among users of Ash, a purpose-built AI mental health chatbot, and examine associations with clinical change and working alliance. Methods: K-means clustering across eight behavioral features identified engagement phenotypes among 102,684 users. Subsamples completed the PHQ-9 (n=298), GAD-7 (n=298), and MSPSS (social support; n=194) baseline and 3 weeks; 11,437 users completed baseline Working Alliance Inventory (WAI). Results: Five engagement phenotypes emerged: Early Dropouts (52.2%), Power Users (1.6%), Intensive Users (4.1%), Weekly Users (25.3%), and a novel Concentrated User pattern (16.8%); across users, 66.9% had at least one overnight session (9pm-5am). Significant pre-post improvements occurred in depression (d = -0.51), anxiety (d = -0.57), and social support (d = 0.22). An observed dose-response gradient in self-reported depression improvement was replicated in a larger sample with model-predicted PHQ-9 (n = 23,813; Power Users d = -0.54; Early Dropouts d = -0.13). Higher working alliance predicted depression improvement and moderated the engagement-social support relationship. Conclusions: Engagement with AI mental health tools is multidimensional, and different clinical outcomes respond to different dimensions of use. Findings caution against treating session counts as a primary engagement metric and offer naturalistic evidence for the clinical value of purpose-built conversational AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper maps five engagement patterns in over 100k AI chatbot users and links them to depression changes via self-report plus model predictions, but the outcome side is thin.

read the letter

The core contribution is a k-means clustering of 102,684 users on eight behavioral features that yields five phenotypes: Early Dropouts, Power Users, Intensive Users, Weekly Users, and a new Concentrated User group that packs sessions into short bursts. They also report pre-post drops in PHQ-9 and GAD-7 in a small subsample and then try to extend the dose-response pattern to 23k users with a model-predicted PHQ-9. The scale of the engagement data is the real strength here; most prior chatbot studies are tiny, so seeing overnight sessions in two-thirds of users and the Concentrated pattern is useful descriptive work. The working alliance findings add a bit more texture on how engagement relates to perceived support. The soft spots are straightforward. Only 298 users supplied actual clinical scores, and nothing in the abstract shows they match the full cohort on severity or motivation, so selection bias is a live concern. Extending the gradient with predicted scores is fine in principle, but the abstract does not clarify whether the prediction model was trained on the same behavioral features used for clustering or on an independent hold-out; if the two overlap, the replication is weaker than it looks. Everything stays observational, so claims about clinical value stay correlational. This is worth reading for anyone building or studying real-world AI mental health tools who needs better engagement metrics than raw session counts. It is not ready to change practice, but the phenotypes and the large behavioral sample give it enough substance that a serious editor should send it out for review rather than desk-reject. Ask the authors for the exact feature list, the prediction model details, and any checks on subsample representativeness.

Referee Report

2 major / 2 minor

Summary. The paper applies k-means clustering to eight behavioral features from 102,684 users of the Ash AI mental health chatbot, identifying five engagement phenotypes (Early Dropouts 52.2%, Power Users 1.6%, Intensive Users 4.1%, Weekly Users 25.3%, Concentrated Users 16.8%). It reports pre-post clinical improvements (PHQ-9 d=-0.51, GAD-7 d=-0.57) in a subsample of n=298 and replicates a dose-response gradient in depression improvement via model-predicted PHQ-9 scores in n=23,813 users, while also examining working alliance (WAI) in 11,437 users and concluding that engagement is multidimensional with differential outcome associations.

Significance. If the central associations hold after addressing selection and prediction issues, the work offers large-scale naturalistic evidence on real-world patterns of AI chatbot engagement and their links to mental health outcomes. Strengths include the scale of the clustering analysis and the explicit caution against relying solely on session counts; the dose-response replication attempt and working-alliance moderation findings could inform chatbot design if the model predictions prove independent of the clustering features.

major comments (2)

[Methods and Results on model-predicted PHQ-9] The replication of the dose-response gradient in depression improvement (Power Users d=-0.54 vs. Early Dropouts d=-0.13) relies on model-predicted PHQ-9 scores for n=23,813 users. The manuscript must specify the training data, features, and validation procedure for this prediction model (Methods section on outcome modeling). If the model was trained using the same eight behavioral engagement features as the k-means clustering or on the n=298 clinical subsample without proper hold-out, the larger-sample gradient is not an independent replication and risks circularity with the phenotype definitions.
[Results on clinical subsamples and dose-response] The clinical outcome analyses rest on a small subsample (n=298 for PHQ-9/GAD-7) drawn from 102,684 users. The paper should report a direct comparison of baseline engagement metrics, demographics, and phenotype distributions between clinical completers and non-completers (Results section on sample characteristics) to evaluate selection bias. Without this, the assumption that the observed dose-response generalizes is not supported and undermines the claim that different engagement dimensions produce differential clinical responses.

minor comments (2)

[Methods on k-means clustering] The choice of k=5 clusters is presented without reported justification such as an elbow plot, silhouette analysis, or stability checks across random seeds; add this to the Methods section on clustering to allow readers to assess sensitivity of the phenotype definitions.
[Results on engagement phenotypes] Table or figure presenting the eight behavioral features and their means per phenotype would improve interpretability of the 'Concentrated User' pattern; currently the abstract and text leave the distinguishing characteristics of this novel phenotype underspecified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's careful reading and valuable suggestions. We will revise the manuscript to address the concerns regarding the prediction model details and potential selection bias in the clinical subsample.

read point-by-point responses

Referee: The replication of the dose-response gradient in depression improvement (Power Users d=-0.54 vs. Early Dropouts d=-0.13) relies on model-predicted PHQ-9 scores for n=23,813 users. The manuscript must specify the training data, features, and validation procedure for this prediction model (Methods section on outcome modeling). If the model was trained using the same eight behavioral engagement features as the k-means clustering or on the n=298 clinical subsample without proper hold-out, the larger-sample gradient is not an independent replication and risks circularity with the phenotype definitions.

Authors: We will revise the Methods section to provide a complete description of the PHQ-9 prediction model, including the training dataset (separate from both the main clustering sample and the clinical subsample), the input features (which do not include the eight behavioral engagement features used for clustering), and the validation approach (with appropriate hold-out procedures). This will confirm that the dose-response analysis in the larger sample is an independent replication and not subject to circularity. revision: yes
Referee: The clinical outcome analyses rest on a small subsample (n=298 for PHQ-9/GAD-7) drawn from 102,684 users. The paper should report a direct comparison of baseline engagement metrics, demographics, and phenotype distributions between clinical completers and non-completers (Results section on sample characteristics) to evaluate selection bias. Without this, the assumption that the observed dose-response generalizes is not supported and undermines the claim that different engagement dimensions produce differential clinical responses.

Authors: We agree that a comparison between clinical completers and non-completers is necessary to assess selection bias. In the revised Results section on sample characteristics, we will include a direct comparison of baseline engagement metrics, demographics, and phenotype distributions for the n=298 users versus the remaining users. This addition will allow readers to better evaluate the generalizability of the findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core analysis uses unsupervised k-means clustering on eight behavioral features to identify five engagement phenotypes in the full 102,684-user cohort, followed by direct pre-post clinical outcome measurements in a small subsample (n=298 for PHQ-9/GAD-7) and a separate supervised model to predict PHQ-9 scores for a larger group (n=23,813). No step reduces by construction to its inputs: the phenotypes are defined independently of the clinical outcomes, the observed dose-response is measured directly in the subsample, and the model-predicted extension applies a fitted mapping to new users without tautologically reproducing the clustering or the small-sample gradient. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The derivation remains self-contained observational analysis.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claims rest on standard clustering assumptions and the validity of self-report scales in a self-selected digital sample; no new physical entities or first-principles derivations are introduced.

free parameters (1)

number of clusters k
k=5 chosen for k-means; abstract does not detail selection criterion such as elbow method or silhouette score.

axioms (2)

domain assumption K-means clustering produces meaningful, separable groups from the chosen behavioral features
Invoked when interpreting the five phenotypes as distinct engagement patterns.
domain assumption Pre-post changes in PHQ-9, GAD-7, and MSPSS reflect true clinical improvement rather than regression to the mean or reporting bias
Central to interpreting dose-response gradients.

invented entities (1)

engagement phenotypes no independent evidence
purpose: Categorize multidimensional usage patterns
Derived directly from clustering; no independent falsifiable prediction provided beyond the data itself.

pith-pipeline@v0.9.0 · 5648 in / 1376 out tokens · 31546 ms · 2026-05-09T19:37:36.600539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 4 canonical work pages · 2 internal anchors

[1]

& Pato, M

Mongelli, F., Georgakopoulos, P. & Pato, M. T. Challenges and Opportunities to Meet the Mental Health Needs of Underserved and Disenfranchised Populations in the United States. Focus 18, 16–24 (2020). 5. Nunes, B. P., Thumé, E., Tomasi, E., Duro, S. M. S. & Facchini, L. A. Socioeconomic inequalities in the access to and quality of health care services. Re...

2020
[2]

Generative AI Purpose-built for Social and Mental Health: A Real-World Pilot

Abd-Alrazaq, A. A., Rababeh, A., Alajlani, M., Bewick, B. M. & Househ, M. Effectiveness and Safety of Using Chatbots to Improve Mental Health: Systematic Review and Meta-Analysis. J. Med. Internet Res. 22, e16021 (2020). 13. Casu, M., Triscari, S., Battiato, S., Guarnera, L. & Caponnetto, P. AI Chatbots for Mental Health: A Scoping Review of Effectiveness...

work page doi:10.48550/arxiv.2511.11689 2020
[3]

How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study

Lipschitz, J. M., Pike, C. K., Hogan, T. P., Murphy, S. A. & Burdick, K. E. The Engagement Problem: a Review of Engagement with Digital Mental Health Interventions and Recommendations for a Path Forward. Curr. Treat. Options Psychiatry 10, 119–135 (2023). 21. Kim, M., Yang, J., Ahn, W.-Y. & Choi, H. J. Machine Learning Analysis to Identify Digital Behavio...

work page internal anchor Pith review arXiv 2023
[4]

& Holmqvist, R

Falkenström, F., Granström, F. & Holmqvist, R. Working alliance predicts psychotherapy outcome even while controlling for prior symptom improvement. Psychother. Res. J. Soc. Psychother. Res. 24, (2013). 28. Napiwotzki, I. et al. Comparing Human and AI Therapists in Behavioral Activation for Depression: Cross-Sectional Questionnaire Study. JMIR Form. Res. ...

2013
[5]

Ajele, K. W. & Idemudia, E. S. Charting the course of depression care: a meta-analysis of reliability generalization of the patient health questionnaire (PHQ- 9) as the measure. Discov. Ment. Health 5, 50 (2025). 36. Lee, E.-H., Kang, E. H., Kang, H.-J. & Lee, H. Y. Measurement invariance of the patient health questionnaire-9 depression scale in a nationa...

2025
[6]

Horvath, A. O. & Greenberg, L. S. Development and validation of the Working Alliance Inventory. J. Couns. Psychol. 36, 223–233 (1989). 44. Paap, D. et al. The Working Alliance Inventory’s Measurement Properties: A Systematic Review. Front. Psychol. 13, 945294 (2022). 45. Yap, L. K., Ennis, E., Mulvenna, M. & Martinez-Carracedo, J. Defining and Measuring E...

work page doi:10.1017/neu.2025.10035 1989
[7]

Talking to a Human as an Attitudinal Barrier: A Mixed Methods Evaluation of Stigma, Access, and the Appeal of AI Mental Health Support

Stamatis, C. A., Wolfe, E. C., Malgaroli, M. & Hull, T. D. Talking to a Human as an Attitudinal Barrier: A Mixed Methods Evaluation of Stigma, Access, and the Appeal of AI Mental Health Support. Preprint at https://doi.org/10.48550/arXiv.2604.09575 (2026). 53. Videtta, G. et al. Effects of therapeutic alliance on patients with major depressive disorder: a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.09575 2026