Functional outcomes and naturalistic engagement with a purpose-built conversational AI for mental health (Ash)
Pith reviewed 2026-06-29 02:17 UTC · model grok-4.3
The pith
Users of the mental health conversational AI Ash reported small within-person gains in functioning and working alliance over four weeks, with engagement levels predicting outcomes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In this single-arm observational cohort study, new users of Ash completed single-item measures of psychological functioning, working alliance, and grandiosity at baseline and week 4; significant within-person improvements appeared across functioning indicators and working alliance with effect sizes of 0.14-0.26, no change occurred in grandiosity, and active days, sessions, and minutes consistently predicted week-4 outcomes after baseline adjustment while message volume did not.
What carries the argument
Within-person change tested by paired-sample t-tests and engagement-outcome links tested by ANCOVAs controlling for baseline, applied to in-app single-item measures of functioning and alliance.
If this is right
- Greater engagement with the AI is associated with better psychological functioning and working alliance at four weeks.
- Use of the AI does not appear to increase grandiosity or inflated self-perception.
- Conversational AI can be evaluated on day-to-day functioning outcomes in addition to symptom reduction.
- Engagement volume in sessions and minutes, but not message count, tracks with outcome gains.
Where Pith is reading between the lines
- If the engagement-outcome link holds in controlled designs, the AI could serve as a low-cost adjunct for maintaining functioning between clinical visits.
- The small effect sizes imply the tool may work best when users already show some baseline motivation to engage regularly.
- Longer-term follow-up could test whether the week-4 gains persist or require ongoing active days to maintain.
Load-bearing premise
The changes and engagement associations reflect effects of Ash use rather than regression to the mean, motivated completers, or outside events, in a design without randomization or a control condition.
What would settle it
A randomized trial assigning participants to Ash versus a waitlist or sham chatbot and finding no between-group differences in the same functioning measures at week 4 would falsify the claim that Ash use drives the observed improvements.
read the original abstract
Background: Conversational AI chatbots designed for mental health may offer an accessible, scalable avenue for supporting psychological well-being, yet prior evaluations have largely focused on clinical symptom reduction rather than broader indicators of day-to-day functioning, and have rarely monitored for potential harms such as inflated self-perception. Objective: We examined within-person change in psychological functioning indicators among real-world users of Ash, a purpose-built conversational AI for mental health support, over the first four weeks of use, and whether these changes were associated with engagement metrics. Methods: In this single-arm observational cohort study, new users (n = 1,284) completed in-app single-item measures of psychological functioning (life satisfaction, relationship satisfaction, sleep quality, behavioral activation), working alliance, and grandiosity (inflated self-perception), at baseline and Week 4. Paired-sample t-tests examined within-person change; ANCOVAs tested engagement-outcome associations at Week 4, controlling for baseline. Results: At baseline, participants reported below-average life satisfaction and fair sleep quality. Significant within-person improvements emerged across all functioning indicators and working alliance (ps < .001; d = 0.14-0.26), with no change in grandiosity. Active days, total sessions, and total minutes consistently predicted Week 4 psychological functioning and working alliance (ps <= .006; partial R^2 range: 0.58-2.15%; controlling for baseline), whereas user message volume did not. Conclusion: Findings provide preliminary data for the potential of evidence-based conversational AI to extend mental health support for broad psychological functioning, extending the existing literature beyond symptom-based outcomes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports results from a single-arm observational cohort study involving 1,284 new users of the conversational AI chatbot 'Ash' designed for mental health support. Using in-app single-item measures, it finds significant within-person improvements from baseline to Week 4 in life satisfaction, relationship satisfaction, sleep quality, behavioral activation, and working alliance (ps < .001; Cohen's d = 0.14-0.26), with no change in grandiosity. Engagement metrics such as active days, total sessions, and total minutes significantly predicted Week 4 outcomes in ANCOVAs controlling for baseline (ps ≤ .006; partial R² = 0.58-2.15%), while user message volume did not. The authors conclude that these findings offer preliminary support for the use of evidence-based conversational AI to improve broad psychological functioning.
Significance. If the within-person changes and engagement associations can be causally linked to Ash use, the study would make a valuable contribution to the literature on mental health chatbots by focusing on functional outcomes rather than just symptoms and by monitoring for potential adverse effects like grandiosity. The naturalistic design with a relatively large sample and the use of appropriate statistical methods (paired t-tests and ANCOVAs) are positive aspects. However, the lack of a control group substantially limits causal claims, which tempers the overall significance.
major comments (3)
- [Methods] The single-arm observational design without a control condition or randomization (as described in the Methods section) means that the significant within-person improvements reported in the Results cannot be unambiguously attributed to Ash use. Alternative explanations such as regression to the mean (noted by below-average baseline life satisfaction) or unmeasured external events remain plausible and are not ruled out by the paired t-tests.
- [Results] The ANCOVA analyses in the Results section show that engagement metrics predict outcomes with small partial R² values (0.58-2.15%). While statistically significant, these effect sizes are modest, and the observational nature of the data (controlling only for baseline) does not establish that increased engagement with Ash causes the observed improvements in functioning.
- [Methods] The study reports results only for the 1,284 participants who completed both baseline and Week 4 measures, but does not provide information on overall attrition rates, characteristics of dropouts, or any analysis to assess potential bias from selective completion by those who improved.
minor comments (1)
- [Abstract] The abstract could more explicitly state the single-arm nature of the study in the Methods summary to set appropriate expectations for readers.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. We address each major comment below and have revised the manuscript to more accurately reflect the preliminary and observational nature of the findings.
read point-by-point responses
-
Referee: [Methods] The single-arm observational design without a control condition or randomization (as described in the Methods section) means that the significant within-person improvements reported in the Results cannot be unambiguously attributed to Ash use. Alternative explanations such as regression to the mean (noted by below-average baseline life satisfaction) or unmeasured external events remain plausible and are not ruled out by the paired t-tests.
Authors: We agree that the single-arm design precludes unambiguous causal attribution. In the revised manuscript we have expanded the Discussion to explicitly list regression to the mean and unmeasured external events as plausible alternative explanations that cannot be ruled out. We have also revised the Abstract, Results, and Conclusion to describe the within-person changes as preliminary associations observed among users rather than effects caused by Ash. The engagement dose-response findings remain correlational only. revision: yes
-
Referee: [Results] The ANCOVA analyses in the Results section show that engagement metrics predict outcomes with small partial R² values (0.58-2.15%). While statistically significant, these effect sizes are modest, and the observational nature of the data (controlling only for baseline) does not establish that increased engagement with Ash causes the observed improvements in functioning.
Authors: We accept that the partial R² values are small and that the observational ANCOVAs do not establish causation. The revised Results section now highlights the modest effect sizes and their limited explanatory power. The Discussion has been updated to state that these associations are consistent with but do not demonstrate a causal role for engagement. We retain the view that small naturalistic associations can still be informative for future controlled work. revision: yes
-
Referee: [Methods] The study reports results only for the 1,284 participants who completed both baseline and Week 4 measures, but does not provide information on overall attrition rates, characteristics of dropouts, or any analysis to assess potential bias from selective completion by those who improved.
Authors: We agree this information is needed to evaluate selection bias. The original submission did not include attrition metrics because the dataset was limited to completers. We have added a Limitations paragraph acknowledging the absence of dropout analysis and the possibility of bias from selective retention. Detailed dropout characteristics could not be recovered from the available app logs. revision: partial
- The single-arm observational design prevents definitive causal claims about Ash; this limitation cannot be resolved without a new controlled study.
Circularity Check
No significant circularity; purely observational empirical analysis
full rationale
The paper is a single-arm observational cohort study reporting within-person changes via paired t-tests and engagement associations via ANCOVAs on in-app survey responses. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the methods or results. All reported statistics (ps, ds, partial R²) are direct computations from the collected data without any reduction of outputs to inputs by construction. The analysis is self-contained against external benchmarks and contains none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Paired t-test and ANCOVA assumptions (approximate normality of differences, homogeneity of regression slopes) hold for the single-item measures used.
- domain assumption Single-item self-report measures validly capture the intended constructs (life satisfaction, grandiosity, etc.).
Reference graph
Works this paper leans on
-
[1]
Introduction The global burden of mental health concerns is high and expected to continue rising (World Health Organization, 2022). While this burden is in part attributed to clinically diagnosed mental health challenges, many people also seek support for non-clinical concerns including life transitions, major events, grief, interpersonal challenges, slee...
2022
-
[2]
reinforcement learning is driven by reward signals focused on health improvement and clinical appropriateness in addition to or instead of simple utilization or user satisfaction signals (see, e.g., (Hull et al., 2026). Extant studies of purpose-built AI tools for mental health support have focused primarily on clinical symptoms, such as anxiety and depre...
2026
-
[3]
During the past week, how would you rate your sleep quality overall?
Users had the option to complete a brief health questionnaire in the app every two weeks, starting at their first session. Questionnaires remained available to complete for up to 7 days. For inclusion in the present study, all participants completed the baseline and Week 4 questionnaire. 2.2 Measures 7 Participants completed single-item questions measurin...
2014
-
[4]
Somewhat
Discussion The present study evaluated whether a purpose-built conversational AI for mental health supports improvement across key domains of psychological functioning. In contrast to prior evaluations of AI tools, which have largely examined clinical outcomes (Casu et al., 2024; Heinz et al., 2024; Hull et al., 2026; Kuta et al., 2026; Wolfe et al., 2026...
2024
-
[5]
Conclusions As the demand for mental health support continues to outweigh available resources, purpose-built AI tools offer an accessible and scalable avenue for supporting population-level well-being. Among a real-world sample of distressed individuals, naturalistic engagement with a purpose-built conversational AI was associated with improvements in key...
2014
-
[6]
https://doi.org/10.1186/1472-6963-14-398 Allen, M., Iliescu, D., & Greiff, S. (2022). Single Item Measures in Psychological Science: A Call to Action: European Journal of Psychological Assessment: Vol 38, No
-
[7]
https://econtent.hogrefe.com/doi/10.1027/1015-5759/a000699 Andrade, L
European Journal of Psychological Assessment. https://econtent.hogrefe.com/doi/10.1027/1015-5759/a000699 Andrade, L. H., Alonso, J., Mneimneh, Z., Wells, J. E., Al-Hamzawi, A., Borges, G., Bromet, E., Bruffaerts, R., de Girolamo, G., de Graaf, R., Florescu, S., Gureje, O., Hinkov, H. R., Hu, C., Huang, Y., Hwang, I., Jin, R., Karam, E. G., Kovess-Masfety,...
-
[8]
https://doi.org/10.1007/s11325-024-03177-z Beatty, C., Malik, T., Meheli, S., & Sinha, C. (2022). Evaluating the Therapeutic Alliance With a Free-Text CBT Conversational Agent (Wysa): A Mixed-Methods Study. Frontiers in Digital Health, 4, 847991. https://doi.org/10.3389/fdgth.2022.847991 22 Bordin, E. S. (1979). The generalizability of the psychoanalytic ...
-
[9]
https://doi.org/10.4103/IJAM.IJAM_49_17 Casu, M., Triscari, S., Battiato, S., Guarnera, L., & Caponnetto, P. (2024). AI Chatbots for Mental Health: A Scoping Review of Effectiveness, Feasibility, and Applications. Applied Sciences, 14(13),
-
[10]
https://doi.org/10.3390/app14135889 Cheng, M., Lee, C., Khadpe, P., Yu, S., Han, D., & Jurafsky, D. (2026). Sycophantic AI decreases prosocial intentions and promotes dependence. Science, 391(6792), eaec8352. https://doi.org/10.1126/science.aec8352 Copeland, W. E., Prytherch, S., Rothenberg, W., Godwin, J. W., Gaydosh, L., Gutin, I., Tong, G., & Shanahan,...
-
[11]
https://doi.org/10.1186/s12874-024-02308-0 Elshaikh, U., Sheik, R., Saeed, R. K. M., Chivese, T., & Alsayed Hassan, D. (2023). Barriers and facilitators of older adults for professional mental health help-seeking: A systematic review. BMC Geriatrics, 23(1),
-
[12]
https://doi.org/10.1186/s12877-023-04229-x Fang, C. M., Liu, A. R., Danry, V., Lee, E., Chan, S. W. T., Pataranutaporn, P., Maes, P., Phang, J., Lampe, M., Ahmad, L., & Agarwal, S. (2025). How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study (arXiv:2503.17473). arXiv. https://doi.org/10....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1186/s12877-023-04229-x 2025
-
[13]
https://doi.org/10.1186/s13034-025-00932-2 Franke Föyen, L., Zapel, E., Lekander, M., Hedman-Lagerlöf, E., & Lindsäter, E. (2025). Artificial intelligence vs. human expert: Licensed mental health clinicians’ blinded evaluation of AI-generated and expert psychological advice on quality, empathy, and perceived authorship. Internet Interventions, 41, 100841....
-
[14]
https://doi.org/10.1186/s40359-021-00655-x 26 Martela, F. (2025). What is positive psychological functioning? The Journal of Positive Psychology, 21(2), 207–224. https://doi.org/10.1080/17439760.2025.2498130 McBain, R. K., Bozick, R., Diliberti, M., Zhang, L. A., Zhang, F., Burnett, A., Kofner, A., Rader, B., Breslau, J., Stein, B. D., Mehrotra, A., Pines...
-
[15]
https://www.oecd.org/en/publications/society-at-a-glance-2024_918d8db3-en/full-report/life-satisfaction_6cbc39c3.html Organisation for Economic Co-operation and Development
Society at a Glance 2024:OECD Social Indicators. https://www.oecd.org/en/publications/society-at-a-glance-2024_918d8db3-en/full-report/life-satisfaction_6cbc39c3.html Organisation for Economic Co-operation and Development. (2013). Question modules. In OECD Guidelines on Measuring Subjective Well-being. OECD Publishing. https://www.ncbi.nlm.nih.gov/books/N...
2024
-
[16]
https://doi.org/10.3389/fpsyg.2023.1059057 Pillai, V., & Drake, C. L. (2015). Chapter 10 - Sleep and Repetitive Thought: The Role of Rumination and Worry in Sleep Disturbance. In K. A. Babson & M. T. Feldner (Eds.), Sleep and Affect (pp. 201–225). Academic Press. https://doi.org/10.1016/B978-0-12-417188-6.00010-4 Pretorius, C., Chambers, D., & Coyle, D. (...
-
[17]
https://doi.org/10.1186/s12888-020-02737-3 Tong, A. C. Y., Wong, K. T. Y., Chung, W. W. T., & Mak, W. W. S. (2025). Effectiveness of Topic-Based Chatbots on Mental Health Self-Care and Mental Well-Being: Randomized Controlled Trial. Journal of Medical Internet Research, 27(1), e70436. https://doi.org/10.2196/70436 Tubbs, A. S., Fernandez, F.-X., Grandner,...
-
[18]
https://doi.org/10.3389/fnetp.2021.830338 Uphoff, E. P., Zamperoni, V., Yap, J., Simmonds, R., Rodgers, M., Dawson, S., Seymour, C., Kousoulis, A., & Churchill, R. (2025). Mental health promotion and protection relating to key life events and transitions in adulthood: A rapid systematic review of systematic reviews. Journal of Mental Health, 34(2), 182–19...
-
[19]
https://doi.org/10.1186/s13033-024-00658-2 Videtta, G., Busilacchi, S., Bartoccioni, G., Cirella, L., Barone, Y., & Delvecchio, G. (2025). Effects of therapeutic alliance on patients with major depressive disorder: A literature review. Frontiers in Psychology,
-
[20]
R., Qian, R., Kannappan, A., Hale, S
https://doi.org/10.3389/fpsyg.2024.1465017 Vidgen, B., Scherrer, N., Kirk, H. R., Qian, R., Kannappan, A., Hale, S. A., & Röttger, P. (2024). SimpleSafetyTests: A Test Suite for Identifying Critical Safety Risks in Large Language Models (arXiv:2311.08370). arXiv. https://doi.org/10.48550/arXiv.2311.08370 Virtanen, P., Gommers, R., Oliphant, T. E., Haberla...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.