Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

Abeer Badawi; Elham Dolatabadi; Frank Rudzicz; Laleh Seyyed-Kalantari; Moyosoreoluwa Olatosi; Negin Baghbanzadeh; R. Shayna Rosenbaum; Sara Pishdadian

arxiv: 2606.18129 · v1 · pith:YDNZOXWKnew · submitted 2026-06-16 · 💻 cs.HC · cs.AI

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

Abeer Badawi , Moyosoreoluwa Olatosi , Negin Baghbanzadeh , Laleh Seyyed-Kalantari , Frank Rudzicz , R. Shayna Rosenbaum , Sara Pishdadian , Elham Dolatabadi This is my paper

Pith reviewed 2026-06-26 22:36 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords cognitive atrophyLLM evaluationmental health supportconversational benchmarkAI behaviourclinical schemaresponse patterns

0 comments

The pith

LLMs display moderate-to-high cognitive atrophy in mental-health conversations, now measurable by a 20-attribute clinical schema.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines cognitive atrophy as a behavioural pattern where LLMs in counseling-style interactions supply directive advice, solutions, and validation instead of supporting user reflection and independent coping. It constructs a benchmark from 1,576 real human counseling conversations and applies an expert schema to rate over 42,000 LLM responses for atrophy risk. A reader would care because standard safety and helpfulness tests overlook this dynamic that could erode user autonomy over repeated exchanges. Results show the pattern holds across models in both single-turn and multi-turn settings, with weaker adaptation when users ask for decisions rather than safety checks.

Core claim

Cognitive atrophy is formalized as a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. The Cognitive Atrophy Bench is built from 1,576 fully human-generated counseling conversations and 15,680 turns, with three clinical experts creating a 20-attribute schema that six reviewers apply to produce 5,324 judgments on 42,230 LLM responses. Across five LLMs the data show consistent moderate-to-high atrophy-aligned behaviour, stronger responses to overt safety cues than to solution or decision requests, and recurring patterns of directive advice, problem-solving, recommendations, topic shifts, and validation that may reinforce dependence rat

What carries the argument

The 20-attribute schema spanning user context, response behaviour, and global risk flags, used to compute the User-Input Risk Index and Cognitive Atrophy Risk Index from reviewer judgments.

Load-bearing premise

The 20-attribute schema developed by clinical experts accurately captures a distinct process called cognitive atrophy that is separate from safety and helpfulness.

What would settle it

A controlled study in which users exposed to high-atrophy versus low-atrophy responses show no measurable difference in subsequent independent reflection or decision-making would falsify the schema's validity as a distinct measure.

Figures

Figures reproduced from arXiv: 2606.18129 by Abeer Badawi, Elham Dolatabadi, Frank Rudzicz, Laleh Seyyed-Kalantari, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, R. Shayna Rosenbaum, Sara Pishdadian.

**Figure 1.** Figure 1: Overview of the COGNITIVE ATROPHY BENCH annotation pipeline, including user-context scoring, response-behaviour evaluation, binary risk flags, and span-grounded evidence. vulnerable users [4, 5, 6]. These cases expose a critical evaluation gap: benchmark scores alone do not reveal how models behave. We argue that the central risk is not only whether a response is unsafe, but whether repeated interactions r… view at source ↗

**Figure 2.** Figure 2: The behavioural attributes used in COGNITIVE ATROPHY BENCH. User-context attributes (U) characterize the clinical demands of the input message; response-behaviour attributes (R) characterize observable LLM response patterns; binary flags (F) capture global risk events. is actively deployed in consumer-facing products that people already use for mental health support online, ensuring that the findings bear… view at source ↗

**Figure 3.** Figure 3: User Input Risk Index (UIRI) band (Low < 0.30, Medium ∈ [0.30, 0.60), High ≥ 0.60) [54]. with "Low" remaining minimal as shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Top-4 strongest input–response correlations ranked by [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Per-attribute change from the opening turn, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Per-cluster spyder charts of spans highlight across LLMs. Each panel groups the highlight [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of the 22 mental-health topic labels in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Complete span-grounded annotation example for a single-turn counseling prompt. The [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 8.** Figure 8: Complete span-grounded annotation example for a single-turn counseling prompt, contin [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Per-conversation UIRI(t) with linear-trend overlay. Each subplot is one conversation; line colour indicates slope sign (red = positive, blue = non-positive). Benjamini–Hochberg FDR control [65], q(k)= min minj≥k(Kp(j)/j), 1 , and declare a cell significant when |ρ|≥0.20 AND q<0.05. Multi-turn analysis adds three correlation scopes: pooled (all 720 turn-units, headline; 250 cells), per-turn (nc=72 at eac… view at source ↗

**Figure 10.** Figure 10: Per-model Spearman ρ between user-input attributes (U1–U5, rows) and LLM response attributes (R1–R10, columns) for each of the five evaluated LLMs. Outlined cells: |ρ|≥0.20 AND BH-FDR q<0.05. Per-model raw correlation matrices [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: Per-attribute change from the opening turn, [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

read the original abstract

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines cognitive atrophy as a new process-level risk in LLM mental-health chats and ships a 20-attribute expert schema plus two indices, but provides no evidence that the construct separates from standard safety or helpfulness measures.

read the letter

The main thing to know is that this work tries to add a time-based behavioral dimension to LLM evaluation in sensitive conversations, but the central claim rests on an untested assumption that the new schema captures something distinct.

What is actually new is the formalization of cognitive atrophy, the UIRI and ARI indices, and the use of 1,576 real human counseling conversations as the base data. The authors pulled three clinical experts to build a 20-attribute schema covering user context, response patterns, and risk flags, then had six reviewers produce over 5,000 judgments on model outputs from five LLMs. That scale and the grounding in actual transcripts is a step beyond most static benchmarks.

The soft spot is the lack of any demonstration that the attributes are separable from helpfulness or safety. Directive advice, problem-solving, and validation responses overlap heavily with existing constructs, yet the paper reports no factor analysis, no correlations with prior benchmarks, and no unique variance checks. Inter-rater reliability numbers and controls for selection bias in the conversation set are also missing from the description. Without those, the moderate-to-high atrophy levels across models remain hard to interpret as a distinct signal.

This paper is for researchers building evaluation tools for therapeutic or high-stakes LLM use. A reader already working on AI safety in mental health could extract the schema and indices as a starting point for further testing. It deserves a serious referee because the underlying problem is real and the data collection effort is substantive, even if the current evidence for distinctness needs strengthening before the claims can land.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces 'cognitive atrophy' as a distinct process-level behavioral measure in LLM mental-health support interactions, separate from safety and helpfulness. It presents COGNITIVE ATROPHY BENCH built from 1,576 human-generated counseling conversations (15,680 turns), a 20-attribute schema developed by three clinical/neuropsychology experts and applied by six reviewers to yield 5,324 judgments, plus new indices (UIRI, ARI) and trajectory summaries. Across five LLMs the work reports consistent moderate-to-high atrophy-aligned behavior, with dominant patterns of directive advice, problem-solving, recommendations, topic shifts, and validation that may reinforce dependence.

Significance. If the schema is shown to capture distinct variance, the work supplies a clinically grounded benchmark and auditing framework for LLM behavior in sensitive, multi-turn settings that existing safety or helpfulness metrics miss. The scale of expert-driven annotation on fully human data and the introduction of process-level indices are concrete strengths that could support future model evaluation and design.

major comments (3)

[Schema development section] The claim that the 20-attribute schema measures a process distinct from safety and helpfulness (Abstract; schema description) rests on expert development and 5,324 judgments but supplies no factor analysis, correlations with existing safety/helpfulness benchmarks, or unique-variance evidence. This distinctness is load-bearing for the central formalization of cognitive atrophy.
[Annotation and results sections] No inter-rater reliability statistics (e.g., Cohen’s κ or ICC) or controls for selection bias in the 1,576 conversations are reported, yet the paper asserts moderate-to-high atrophy levels across models and settings (Abstract; results).
[Results reporting] The reported moderate-to-high atrophy-aligned behavior lacks accompanying statistical tests, confidence intervals, or effect-size measures that would substantiate the cross-model and single/multi-turn consistency claims.

minor comments (2)

[Index definitions] Explicit formulas or pseudocode for UIRI and ARI would improve reproducibility.
[Tables and figures] Figure captions and table legends could more clearly indicate which attributes map to user context, response behaviour, and global risk flags.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the manuscript. We address each major comment point-by-point below, indicating where revisions will be made.

read point-by-point responses

Referee: The claim that the 20-attribute schema measures a process distinct from safety and helpfulness (Abstract; schema description) rests on expert development and 5,324 judgments but supplies no factor analysis, correlations with existing safety/helpfulness benchmarks, or unique-variance evidence. This distinctness is load-bearing for the central formalization of cognitive atrophy.

Authors: The schema was developed by three experts in clinical and neuropsychology specifically to capture process-level behaviors in mental-health support interactions that are not addressed by standard safety or helpfulness metrics. While we agree that empirical validation such as factor analysis or correlations would strengthen the claim of distinctness, the current work relies on the expert construction and the volume of judgments. In revision, we will include an analysis of correlations with available model safety scores where applicable and add a dedicated limitations subsection discussing the need for further validation of distinct variance. This addresses the concern without altering the core contribution. revision: partial
Referee: No inter-rater reliability statistics (e.g., Cohen’s κ or ICC) or controls for selection bias in the 1,576 conversations are reported, yet the paper asserts moderate-to-high atrophy levels across models and settings (Abstract; results).

Authors: We acknowledge the omission of inter-rater reliability statistics. The six reviewers were trained clinical professionals applying the schema with span-grounded evidence. We will compute and report appropriate reliability metrics such as Cohen’s κ in the revised annotation section. For selection bias, the 1,576 conversations were drawn from established public counseling conversation datasets; we will expand the methods to detail the sampling strategy and any stratification used. These additions will be included in the revision. revision: yes
Referee: The reported moderate-to-high atrophy-aligned behavior lacks accompanying statistical tests, confidence intervals, or effect-size measures that would substantiate the cross-model and single/multi-turn consistency claims.

Authors: We agree that the results would benefit from additional statistical rigor. In the revised results section, we will incorporate statistical tests for model comparisons, confidence intervals around the reported atrophy levels, and effect size measures to support the claims of consistency across models and settings. This will provide a more robust substantiation of the findings. revision: yes

Circularity Check

0 steps flagged

No circularity: new indices constructed from expert schema without reduction to fitted inputs or self-citations

full rationale

The paper defines cognitive atrophy as a new construct via an expert-developed 20-attribute schema, then applies it to produce UIRI, ARI, and trajectory summaries on LLM outputs. These steps are definitional and measurement-oriented rather than predictive; the indices are computed directly from the schema judgments on the collected conversations and do not reduce by construction to quantities fitted from the same data or to any self-citation chain. No equations, uniqueness theorems, or ansatzes are imported from prior author work to force the central distinction from safety/helpfulness. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the validity of the expert schema and the assumption that the selected human conversations represent realistic mental-health interactions; no free parameters are described, but the new concept and indices constitute invented entities without external falsifiable handles mentioned.

axioms (1)

domain assumption The 20-attribute schema developed by three experts accurately measures cognitive atrophy as distinct from safety and helpfulness
Invoked when the schema is used to generate the 5,324 reviewer judgments that support the reported atrophy levels.

invented entities (3)

Cognitive Atrophy no independent evidence
purpose: Process-level behavioural measure for reduction in user reflection and decision-making during LLM mental-health interactions
Newly formalized concept to address the evaluation gap described in the abstract.
User-Input Risk Index (UIRI) no independent evidence
purpose: Quantify risk level in user inputs for atrophy analysis
New index introduced alongside the benchmark.
Cognitive Atrophy Risk Index (ARI) no independent evidence
purpose: Quantify atrophy risk from model responses
New index introduced alongside the benchmark.

pith-pipeline@v0.9.1-grok · 5858 in / 1590 out tokens · 34336 ms · 2026-06-26T22:36:39.454611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 1 linked inside Pith

[1]

Vaidyam, Hannah Wisniewski, John D

Aditya N. Vaidyam, Hannah Wisniewski, John D. Halamka, Matcheri S. Kashavan, and John B. Torous. Chatbots and conversational agents in mental health: A review of the psychiatric landscape.Canadian Journal of Psychiatry, 64(7):456–464, 2019

2019
[2]

Dunn, Huong Ly Tong, et al

Liliana Laranjo, Adam G. Dunn, Huong Ly Tong, et al. Conversational agents in healthcare: A systematic review.Journal of Medical Internet Research, 20(5):e124, 2018

2018
[3]

Position: Beyond assistance–reimagining llms as ethical and adaptive co-creators in mental health care

Abeer Badawi, Md Tahmid Rahman Laskar, Jimmy Xiangji Huang, Shaina Raza, and Elham Dolatabadi. Position: Beyond assistance–reimagining llms as ethical and adaptive co-creators in mental health care. arXiv preprint arXiv:2503.16456, 2025

arXiv 2025
[4]

Incident 1192: 16-year-old allegedly received suicide-related guidance from chatgpt, 2025

AI Incident Database. Incident 1192: 16-year-old allegedly received suicide-related guidance from chatgpt, 2025

2025
[5]

Incident report: Openai chatgpt and suicide-related harms (incident id: 1106), 2025

AI Incident Database. Incident report: Openai chatgpt and suicide-related harms (incident id: 1106), 2025

2025
[6]

ai psychosis

A. Hudon and E. Stip. Delusional experiences emerging from ai chatbot interactions or “ai psychosis”. JMIR Mental Health, 12:e85799, 2025

2025
[7]

Risko and Sam J

Evan F. Risko and Sam J. Gilbert. Cognitive offloading.Trends in Cognitive Sciences, 20(9):676–688, 2016

2016
[8]

Betsy Sparrow, Jenny Liu, and Daniel M. Wegner. Google effects on memory: Cognitive consequences of having information at our fingertips.Science, 333(6043):776–778, 2011

2011
[9]

Meyerhoff

Sandra Grinschgl, Frank Papenmeier, and Hauke S. Meyerhoff. Consequences of cognitive offloading: Boosting performance but diminishing memory.Quarterly Journal of Experimental Psychology, 2021

2021
[10]

Wood, Jerome S

David J. Wood, Jerome S. Bruner, and Gail Ross. The role of tutoring in problem solving.Journal of Child Psychology and Psychiatry, 17(2):89–100, 1976

1976
[11]

Miller and Stephen Rollnick.Motivational Interviewing: Helping People Change

William R. Miller and Stephen Rollnick.Motivational Interviewing: Helping People Change. Guilford Press, New York, NY , 3 edition, 2013. 10

2013
[12]

Haoan Jin, Siyuan Chen, Mengyue Wu, and Kenny Q. Zhu. PsyEval: A suite of mental health related tasks for evaluating large language models.arXiv preprint arXiv:2311.09189, 2023

arXiv 2023
[13]

Mhqa: A diverse, knowledge intensive mental health question answering challenge for language models.arXiv preprint arXiv:2502.15418, 2025

Suraj Racha, Prashant Joshi, Anshika Raman, Nikita Jangid, Mridul Sharma, Ganesh Ramakrishnan, and Nirmal Punjabi. Mhqa: A diverse, knowledge intensive mental health question answering challenge for language models.arXiv preprint arXiv:2502.15418, 2025

arXiv 2025
[14]

Conceptpsy: A comprehensive benchmark suite for hierarchical psychological concept understanding in llms.Neurocomputing, 637:130070, 2025

Junlei Zhang, Hongliang He, Lizhi Ma, Nirui Song, Shuyuan He, Shuai Zhang, Huachuan Qiu, Zhan- chao Zhou, Anqi Li, Yong Dai, et al. Conceptpsy: A comprehensive benchmark suite for hierarchical psychological concept understanding in llms.Neurocomputing, 637:130070, 2025

2025
[15]

Soled, Michael L

Viet Cuong Nguyen, Mohammad Taher, Dongwan Hong, Vinicius Konkolics Possobom, Vibha Thirunellayi Gopalakrishnan, Ekta Raj, Zihang Li, Heather J. Soled, Michael L. Birnbaum, Srijan Kumar, et al. Do large language models align with core mental health counseling competencies?arXiv preprint arXiv:2410.22446, 2024

arXiv 2024
[16]

Bunyi, Adam C

Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, and Ruishan Liu. Counselbench: A large-scale expert evaluation and adversarial benchmark of large language models in mental health counseling.arXiv preprint arXiv:2506.08584, 2025

Pith/arXiv arXiv 2025
[17]

Being kind isn’t always being safe: Diagnosing affective hallucination in llms.arXiv preprint arXiv:2508.16921, 2026

Sewon Kim, Jiwon Kim, Seungwoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, and Hyunsoo Yoon. Being kind isn’t always being safe: Diagnosing affective hallucination in llms.arXiv preprint arXiv:2508.16921, 2026

arXiv 2026
[18]

Pair: Prompt-aware margin ranking for counselor reflection scoring in motivational interviewing

Do June Min, Verónica Pérez-Rosas, Kenneth Resnicow, and Rada Mihalcea. Pair: Prompt-aware margin ranking for counselor reflection scoring in motivational interviewing. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 148–158, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics

2022
[19]

Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations

Ganeshan Malhotra, Abdul Waheed, Ashutosh Srivastava, Md Shad Akhtar, and Tanmoy Chakraborty. Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations. InProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, pages 735–745. Association for Computing Machinery, 2022

2022
[20]

Anno-mi: A dataset of expert-annotated counselling dialogues

Zixiu Wu, Simone Balloccu, Vivek Kumar, Rim Helaoui, Ehud Reiter, Diego Reforgiato Recupero, and Daniele Riboni. Anno-mi: A dataset of expert-annotated counselling dialogues. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6177–6181, 2022

2022
[21]

experiencing

Antonio Pascual-Leone and Nadiia Yeryomenko. The client “experiencing” scale as a predictor of treatment outcomes: A meta-analysis on psychotherapy process.Psychotherapy Research, 27(6):653–665, 2017

2017
[22]

Bolton, David Gunnell, and Gustavo Turecki

James M. Bolton, David Gunnell, and Gustavo Turecki. Suicide risk assessment and intervention in people with mental illness.BMJ, 351, 2015

2015
[23]

A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement

Surjodeep Sarkar, Manas Gaur, Lujie Karen Chen, Muskan Garg, and Biplav Srivastava. A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement. Frontiers in Artificial Intelligence, 6:1229805, 2023

2023
[24]

How llm counselors violate ethical standards in mental health practice

Zainab Iftikhar, Annie Xiao, Sarah Ransom, Jeff Huang, and Harini Suresh. How llm counselors violate ethical standards in mental health practice. InAAAI/ACM Conference on AI, Ethics, and Society, 2025

2025
[25]

Ong, and Nick Haber

Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, pages 1743–1757. Association for Computing M...

2025
[26]

A comparison of responses from human therapists and large language model–based chatbots to assess therapeutic communication: Mixed methods study.JMIR Mental Health, 12:e69709, 2025

Till Scholich, Maya Barr, Shannon Wiltsey Stirman, and Shriti Raj. A comparison of responses from human therapists and large language model–based chatbots to assess therapeutic communication: Mixed methods study.JMIR Mental Health, 12:e69709, 2025

2025
[27]

When can we trust llms in mental health? large-scale benchmarks for reliable llm evaluation.arXiv preprint arXiv:2510.19032, 2025

Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi. When can we trust llms in mental health? large-scale benchmarks for reliable llm evaluation.arXiv preprint arXiv:2510.19032, 2025

arXiv 2025
[28]

Assessing the quality of mental health support in llm responses through multi-attribute human evaluation

Abeer Badawi, Md Tahmid Rahman Laskar, Elahe Rahimi, Sheri Grach, Lindsay Bertrand, Lames Danok, Frank Rudzicz, Jimmy Huang, and Elham Dolatabadi. Assessing the quality of mental health support in llm responses through multi-attribute human evaluation. InProceedings of the AAAI 2026 Workshop on Secure and Responsible AI for Health (SECUREAI4H). Associatio...

2026
[29]

Richard Wohl

Donald Horton and R. Richard Wohl. Mass communication and para-social interaction: Observations on intimacy at a distance.Psychiatry, 19(3):215–229, 1956

1956
[30]

My ai friend: How users of a social chatbot understand their human–ai friendship.Human Communication Research, 48(3):404–429, 2022

Petter Bae Brandtzaeg, Marita Skjuve, and Asbjørn Følstad. My ai friend: How users of a social chatbot understand their human–ai friendship.Human Communication Research, 48(3):404–429, 2022

2022
[31]

Attachment theory as a framework to understand relationships with social chatbots: A case study of replika

Tianling Xie and Iryna Pentina. Attachment theory as a framework to understand relationships with social chatbots: A case study of replika. InProceedings of the 55th Hawaii International Conference on System Sciences, 2022

2022
[32]

Chiu, Shaun M

Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C. Chiu, Shaun M. Eack, Fei Fang, William Yang Wang, and Zhiyu Zoey Chen. Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy.arXiv preprint arXiv:2410.13218, 2024. 2024c

arXiv 2024
[33]

Exploring safety alignment evaluation of llms in chinese mental health dialogues via llm-as-judge.arXiv preprint arXiv:2508.08236, 2025

Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, and Mingming Fan. Exploring safety alignment evaluation of llms in chinese mental health dialogues via llm-as-judge.arXiv preprint arXiv:2508.08236, 2025

arXiv 2025
[34]

Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling.arXiv preprint arXiv:2405.16433, 2024

Chenhao Zhang, Renhao Li, Minghuan Tan, Min Yang, Jingwei Zhu, Di Yang, Jiahao Zhao, Guancheng Ye, Chengming Li, and Xiping Hu. Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling.arXiv preprint arXiv:2405.16433, 2024

arXiv 2024
[35]

Mentalchat16k: A benchmark dataset for conversational mental health assistance

Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, and Li Shen. Mentalchat16k: A benchmark dataset for conversational mental health assistance. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) V . 2, pages 5367–5378, 2025

2025
[36]

Can llms move beyond short exchanges to realistic therapy conversations? InThe Fourteenth International Conference on Learning Representations, 2026

Zhengqing Yuan, Liang Wu, Jian Xu, Zheyuan Zhang, Kaiwen Shi, Weixiang Sun, Lichao Sun, and Yanfang Ye. Can llms move beyond short exchanges to realistic therapy conversations? InThe Fourteenth International Conference on Learning Representations, 2026

2026
[37]

C. Konnoth. AI and data protection law in health. In B. Solaiman and I. G. Cohen, editors,Research Handbook on Health, AI and the Law. Edward Elgar Publishing, Cheltenham, UK, 2024

2024
[38]

Counsel chat: Bootstrapping high-quality therapy data

Nicolas Bertagnolli. Counsel chat: Bootstrapping high-quality therapy data. https://huggingface. co/datasets/nbertagnolli/counsel-chat, 2020. Hugging Face dataset

2020
[39]

Bischof, A

G. Bischof, A. Bischof, and H.-J. Rumpf. Motivational interviewing: An evidence-based approach for use in medical practice.Deutsches Ärzteblatt International, 118:109–115, 2021

2021
[40]

A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

2025
[41]

GPT-4o system card.https://openai.com/index/gpt-4o-system-card, 2025

OpenAI. GPT-4o system card.https://openai.com/index/gpt-4o-system-card, 2025

2025
[42]

Claude sonnet 4 model card.https://www.anthropic.com/claude/sonnet, 2025

Anthropic. Claude sonnet 4 model card.https://www.anthropic.com/claude/sonnet, 2025

2025
[43]

Gemini 2.5 flash

Google DeepMind. Gemini 2.5 flash. https://deepmind.google/technologies/gemini/flash, 2025

2025
[44]

Llama 4: The next generation of meta’s open foundation models

Meta AI. Llama 4: The next generation of meta’s open foundation models. https://ai.meta.com/ blog/llama-4-multimodal-intelligence, 2025

2025
[45]

Qwen3 technical report.https://qwenlm.github.io/blog/qwen3, 2025

Qwen Team. Qwen3 technical report.https://qwenlm.github.io/blog/qwen3, 2025

2025
[46]

Marsha M. Linehan. Dialectical behavior therapy for treatment of borderline personality disorder: Implica- tions for the treatment of substance abuse.NIDA Research Monograph, 137:201, 1993

1993
[47]

Pim Cuijpers, Mirjam Reijnders, and Marcus J. H. Huibers. The role of common factors in psychotherapy outcomes.Annual Review of Clinical Psychology, 15(1):207–231, 2019

2019
[48]

Wampold and Zac E

Bruce E. Wampold and Zac E. Imel.The Great Psychotherapy Debate: The Evidence for What Makes Psychotherapy Work. Routledge, 2015

2015
[49]

Westra and Nahal Norouzian

Henny A. Westra and Nahal Norouzian. Using motivational interviewing to manage process markers of ambivalence and resistance in cognitive behavioral therapy.Cognitive Therapy and Research, 42(2):193– 203, 2018. 12

2018
[50]

The art of tentativity: Delivering interpretations in psychodynamic psychotherapy.Journal of Pragmatics, 176:76–96, 2021

Anja Stukenbrock, Arnulf Deppermann, and Carl Eduard Scheidt. The art of tentativity: Delivering interpretations in psychodynamic psychotherapy.Journal of Pragmatics, 176:76–96, 2021

2021
[51]

Bohart, Jeanne C

Robert Elliott, Arthur C. Bohart, Jeanne C. Watson, and David Murphy. Therapist empathy and client outcome: An updated meta-analysis.Psychotherapy, 55(4):399, 2018

2018
[52]

Borelli, Leah Sohn, Beverly A

Jessica L. Borelli, Leah Sohn, Beverly A. Wang, Kathy Hong, Cassandra DeCoste, and Nancy E. Suchman. Therapist–client language matching: Initial promise as a measure of therapist–client relationship quality. Psychoanalytic Psychology, 36(1):9, 2019

2019
[53]

Norcross and Michael J

John C. Norcross and Michael J. Lambert. Psychotherapy relationships that work III.Psychotherapy, 55(4):303, 2018

2018
[54]

G. O. Boateng, T. B. Neilands, E. A. Frongillo, H. R. Melgar-Quinonez, and S. L. Young. Best practices for developing and validating scales for health, social, and behavioral research: a primer.Frontiers in Public Health, 6:149, 2018

2018
[55]

Spearman

C. Spearman. The proof and measurement of association between two things.American Journal of Psychology, 15(1):72–101, 1904

1904
[56]

Virtanen, R

P. Virtanen, R. Gommers, T. E. Oliphant, et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python.Nature Methods, 17:261–272, 2020

2020
[57]

M. G. Kendall and A. Stuart.The Advanced Theory of Statistics, Vol. 2: Inference and Relationship. Charles Griffin, 3 edition, 1973

1973
[58]

Cohen.Statistical Power Analysis for the Behavioral Sciences

J. Cohen.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988

1988
[59]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33:159–174, 1977

1977
[60]

Miller et al

William R. Miller et al. Manual for the motivational interviewing skill code (misc), version 2. University of New Mexico, 2003

2003
[61]

Moyers et al

Theresa B. Moyers et al. Motivational interviewing treatment integrity coding manual 4.1. University of New Mexico, 2014

2014
[62]

Feinstein and Domenic V

Alvan R. Feinstein and Domenic V . Cicchetti. High agreement but low kappa.Journal of Clinical Epidemiology, 43:543–549, 1990

1990
[63]

Sage, 2004

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage, 2004

2004
[64]

B. Efron. Bootstrap methods: another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

1979
[65]

N/A” indicates that the benchmark does not include psychotherapeutic response evaluation, and “N/C

Y . Benjamini and Y . Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995. A Ethics, Data, and Release The human-evaluation protocol was reviewed and approved by the authors’ Institutional Review Board (details withhel...

1995

[1] [1]

Vaidyam, Hannah Wisniewski, John D

Aditya N. Vaidyam, Hannah Wisniewski, John D. Halamka, Matcheri S. Kashavan, and John B. Torous. Chatbots and conversational agents in mental health: A review of the psychiatric landscape.Canadian Journal of Psychiatry, 64(7):456–464, 2019

2019

[2] [2]

Dunn, Huong Ly Tong, et al

Liliana Laranjo, Adam G. Dunn, Huong Ly Tong, et al. Conversational agents in healthcare: A systematic review.Journal of Medical Internet Research, 20(5):e124, 2018

2018

[3] [3]

Position: Beyond assistance–reimagining llms as ethical and adaptive co-creators in mental health care

Abeer Badawi, Md Tahmid Rahman Laskar, Jimmy Xiangji Huang, Shaina Raza, and Elham Dolatabadi. Position: Beyond assistance–reimagining llms as ethical and adaptive co-creators in mental health care. arXiv preprint arXiv:2503.16456, 2025

arXiv 2025

[4] [4]

Incident 1192: 16-year-old allegedly received suicide-related guidance from chatgpt, 2025

AI Incident Database. Incident 1192: 16-year-old allegedly received suicide-related guidance from chatgpt, 2025

2025

[5] [5]

Incident report: Openai chatgpt and suicide-related harms (incident id: 1106), 2025

AI Incident Database. Incident report: Openai chatgpt and suicide-related harms (incident id: 1106), 2025

2025

[6] [6]

ai psychosis

A. Hudon and E. Stip. Delusional experiences emerging from ai chatbot interactions or “ai psychosis”. JMIR Mental Health, 12:e85799, 2025

2025

[7] [7]

Risko and Sam J

Evan F. Risko and Sam J. Gilbert. Cognitive offloading.Trends in Cognitive Sciences, 20(9):676–688, 2016

2016

[8] [8]

Betsy Sparrow, Jenny Liu, and Daniel M. Wegner. Google effects on memory: Cognitive consequences of having information at our fingertips.Science, 333(6043):776–778, 2011

2011

[9] [9]

Meyerhoff

Sandra Grinschgl, Frank Papenmeier, and Hauke S. Meyerhoff. Consequences of cognitive offloading: Boosting performance but diminishing memory.Quarterly Journal of Experimental Psychology, 2021

2021

[10] [10]

Wood, Jerome S

David J. Wood, Jerome S. Bruner, and Gail Ross. The role of tutoring in problem solving.Journal of Child Psychology and Psychiatry, 17(2):89–100, 1976

1976

[11] [11]

Miller and Stephen Rollnick.Motivational Interviewing: Helping People Change

William R. Miller and Stephen Rollnick.Motivational Interviewing: Helping People Change. Guilford Press, New York, NY , 3 edition, 2013. 10

2013

[12] [12]

Haoan Jin, Siyuan Chen, Mengyue Wu, and Kenny Q. Zhu. PsyEval: A suite of mental health related tasks for evaluating large language models.arXiv preprint arXiv:2311.09189, 2023

arXiv 2023

[13] [13]

Mhqa: A diverse, knowledge intensive mental health question answering challenge for language models.arXiv preprint arXiv:2502.15418, 2025

Suraj Racha, Prashant Joshi, Anshika Raman, Nikita Jangid, Mridul Sharma, Ganesh Ramakrishnan, and Nirmal Punjabi. Mhqa: A diverse, knowledge intensive mental health question answering challenge for language models.arXiv preprint arXiv:2502.15418, 2025

arXiv 2025

[14] [14]

Conceptpsy: A comprehensive benchmark suite for hierarchical psychological concept understanding in llms.Neurocomputing, 637:130070, 2025

Junlei Zhang, Hongliang He, Lizhi Ma, Nirui Song, Shuyuan He, Shuai Zhang, Huachuan Qiu, Zhan- chao Zhou, Anqi Li, Yong Dai, et al. Conceptpsy: A comprehensive benchmark suite for hierarchical psychological concept understanding in llms.Neurocomputing, 637:130070, 2025

2025

[15] [15]

Soled, Michael L

Viet Cuong Nguyen, Mohammad Taher, Dongwan Hong, Vinicius Konkolics Possobom, Vibha Thirunellayi Gopalakrishnan, Ekta Raj, Zihang Li, Heather J. Soled, Michael L. Birnbaum, Srijan Kumar, et al. Do large language models align with core mental health counseling competencies?arXiv preprint arXiv:2410.22446, 2024

arXiv 2024

[16] [16]

Bunyi, Adam C

Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, and Ruishan Liu. Counselbench: A large-scale expert evaluation and adversarial benchmark of large language models in mental health counseling.arXiv preprint arXiv:2506.08584, 2025

Pith/arXiv arXiv 2025

[17] [17]

Being kind isn’t always being safe: Diagnosing affective hallucination in llms.arXiv preprint arXiv:2508.16921, 2026

Sewon Kim, Jiwon Kim, Seungwoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, and Hyunsoo Yoon. Being kind isn’t always being safe: Diagnosing affective hallucination in llms.arXiv preprint arXiv:2508.16921, 2026

arXiv 2026

[18] [18]

Pair: Prompt-aware margin ranking for counselor reflection scoring in motivational interviewing

Do June Min, Verónica Pérez-Rosas, Kenneth Resnicow, and Rada Mihalcea. Pair: Prompt-aware margin ranking for counselor reflection scoring in motivational interviewing. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 148–158, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics

2022

[19] [19]

Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations

Ganeshan Malhotra, Abdul Waheed, Ashutosh Srivastava, Md Shad Akhtar, and Tanmoy Chakraborty. Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations. InProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, pages 735–745. Association for Computing Machinery, 2022

2022

[20] [20]

Anno-mi: A dataset of expert-annotated counselling dialogues

Zixiu Wu, Simone Balloccu, Vivek Kumar, Rim Helaoui, Ehud Reiter, Diego Reforgiato Recupero, and Daniele Riboni. Anno-mi: A dataset of expert-annotated counselling dialogues. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6177–6181, 2022

2022

[21] [21]

experiencing

Antonio Pascual-Leone and Nadiia Yeryomenko. The client “experiencing” scale as a predictor of treatment outcomes: A meta-analysis on psychotherapy process.Psychotherapy Research, 27(6):653–665, 2017

2017

[22] [22]

Bolton, David Gunnell, and Gustavo Turecki

James M. Bolton, David Gunnell, and Gustavo Turecki. Suicide risk assessment and intervention in people with mental illness.BMJ, 351, 2015

2015

[23] [23]

A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement

Surjodeep Sarkar, Manas Gaur, Lujie Karen Chen, Muskan Garg, and Biplav Srivastava. A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement. Frontiers in Artificial Intelligence, 6:1229805, 2023

2023

[24] [24]

How llm counselors violate ethical standards in mental health practice

Zainab Iftikhar, Annie Xiao, Sarah Ransom, Jeff Huang, and Harini Suresh. How llm counselors violate ethical standards in mental health practice. InAAAI/ACM Conference on AI, Ethics, and Society, 2025

2025

[25] [25]

Ong, and Nick Haber

Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, pages 1743–1757. Association for Computing M...

2025

[26] [26]

A comparison of responses from human therapists and large language model–based chatbots to assess therapeutic communication: Mixed methods study.JMIR Mental Health, 12:e69709, 2025

Till Scholich, Maya Barr, Shannon Wiltsey Stirman, and Shriti Raj. A comparison of responses from human therapists and large language model–based chatbots to assess therapeutic communication: Mixed methods study.JMIR Mental Health, 12:e69709, 2025

2025

[27] [27]

When can we trust llms in mental health? large-scale benchmarks for reliable llm evaluation.arXiv preprint arXiv:2510.19032, 2025

Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi. When can we trust llms in mental health? large-scale benchmarks for reliable llm evaluation.arXiv preprint arXiv:2510.19032, 2025

arXiv 2025

[28] [28]

Assessing the quality of mental health support in llm responses through multi-attribute human evaluation

Abeer Badawi, Md Tahmid Rahman Laskar, Elahe Rahimi, Sheri Grach, Lindsay Bertrand, Lames Danok, Frank Rudzicz, Jimmy Huang, and Elham Dolatabadi. Assessing the quality of mental health support in llm responses through multi-attribute human evaluation. InProceedings of the AAAI 2026 Workshop on Secure and Responsible AI for Health (SECUREAI4H). Associatio...

2026

[29] [29]

Richard Wohl

Donald Horton and R. Richard Wohl. Mass communication and para-social interaction: Observations on intimacy at a distance.Psychiatry, 19(3):215–229, 1956

1956

[30] [30]

My ai friend: How users of a social chatbot understand their human–ai friendship.Human Communication Research, 48(3):404–429, 2022

Petter Bae Brandtzaeg, Marita Skjuve, and Asbjørn Følstad. My ai friend: How users of a social chatbot understand their human–ai friendship.Human Communication Research, 48(3):404–429, 2022

2022

[31] [31]

Attachment theory as a framework to understand relationships with social chatbots: A case study of replika

Tianling Xie and Iryna Pentina. Attachment theory as a framework to understand relationships with social chatbots: A case study of replika. InProceedings of the 55th Hawaii International Conference on System Sciences, 2022

2022

[32] [32]

Chiu, Shaun M

Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C. Chiu, Shaun M. Eack, Fei Fang, William Yang Wang, and Zhiyu Zoey Chen. Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy.arXiv preprint arXiv:2410.13218, 2024. 2024c

arXiv 2024

[33] [33]

Exploring safety alignment evaluation of llms in chinese mental health dialogues via llm-as-judge.arXiv preprint arXiv:2508.08236, 2025

Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, and Mingming Fan. Exploring safety alignment evaluation of llms in chinese mental health dialogues via llm-as-judge.arXiv preprint arXiv:2508.08236, 2025

arXiv 2025

[34] [34]

Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling.arXiv preprint arXiv:2405.16433, 2024

Chenhao Zhang, Renhao Li, Minghuan Tan, Min Yang, Jingwei Zhu, Di Yang, Jiahao Zhao, Guancheng Ye, Chengming Li, and Xiping Hu. Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling.arXiv preprint arXiv:2405.16433, 2024

arXiv 2024

[35] [35]

Mentalchat16k: A benchmark dataset for conversational mental health assistance

Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, and Li Shen. Mentalchat16k: A benchmark dataset for conversational mental health assistance. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) V . 2, pages 5367–5378, 2025

2025

[36] [36]

Can llms move beyond short exchanges to realistic therapy conversations? InThe Fourteenth International Conference on Learning Representations, 2026

Zhengqing Yuan, Liang Wu, Jian Xu, Zheyuan Zhang, Kaiwen Shi, Weixiang Sun, Lichao Sun, and Yanfang Ye. Can llms move beyond short exchanges to realistic therapy conversations? InThe Fourteenth International Conference on Learning Representations, 2026

2026

[37] [37]

C. Konnoth. AI and data protection law in health. In B. Solaiman and I. G. Cohen, editors,Research Handbook on Health, AI and the Law. Edward Elgar Publishing, Cheltenham, UK, 2024

2024

[38] [38]

Counsel chat: Bootstrapping high-quality therapy data

Nicolas Bertagnolli. Counsel chat: Bootstrapping high-quality therapy data. https://huggingface. co/datasets/nbertagnolli/counsel-chat, 2020. Hugging Face dataset

2020

[39] [39]

Bischof, A

G. Bischof, A. Bischof, and H.-J. Rumpf. Motivational interviewing: An evidence-based approach for use in medical practice.Deutsches Ärzteblatt International, 118:109–115, 2021

2021

[40] [40]

A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

2025

[41] [41]

GPT-4o system card.https://openai.com/index/gpt-4o-system-card, 2025

OpenAI. GPT-4o system card.https://openai.com/index/gpt-4o-system-card, 2025

2025

[42] [42]

Claude sonnet 4 model card.https://www.anthropic.com/claude/sonnet, 2025

Anthropic. Claude sonnet 4 model card.https://www.anthropic.com/claude/sonnet, 2025

2025

[43] [43]

Gemini 2.5 flash

Google DeepMind. Gemini 2.5 flash. https://deepmind.google/technologies/gemini/flash, 2025

2025

[44] [44]

Llama 4: The next generation of meta’s open foundation models

Meta AI. Llama 4: The next generation of meta’s open foundation models. https://ai.meta.com/ blog/llama-4-multimodal-intelligence, 2025

2025

[45] [45]

Qwen3 technical report.https://qwenlm.github.io/blog/qwen3, 2025

Qwen Team. Qwen3 technical report.https://qwenlm.github.io/blog/qwen3, 2025

2025

[46] [46]

Marsha M. Linehan. Dialectical behavior therapy for treatment of borderline personality disorder: Implica- tions for the treatment of substance abuse.NIDA Research Monograph, 137:201, 1993

1993

[47] [47]

Pim Cuijpers, Mirjam Reijnders, and Marcus J. H. Huibers. The role of common factors in psychotherapy outcomes.Annual Review of Clinical Psychology, 15(1):207–231, 2019

2019

[48] [48]

Wampold and Zac E

Bruce E. Wampold and Zac E. Imel.The Great Psychotherapy Debate: The Evidence for What Makes Psychotherapy Work. Routledge, 2015

2015

[49] [49]

Westra and Nahal Norouzian

Henny A. Westra and Nahal Norouzian. Using motivational interviewing to manage process markers of ambivalence and resistance in cognitive behavioral therapy.Cognitive Therapy and Research, 42(2):193– 203, 2018. 12

2018

[50] [50]

The art of tentativity: Delivering interpretations in psychodynamic psychotherapy.Journal of Pragmatics, 176:76–96, 2021

Anja Stukenbrock, Arnulf Deppermann, and Carl Eduard Scheidt. The art of tentativity: Delivering interpretations in psychodynamic psychotherapy.Journal of Pragmatics, 176:76–96, 2021

2021

[51] [51]

Bohart, Jeanne C

Robert Elliott, Arthur C. Bohart, Jeanne C. Watson, and David Murphy. Therapist empathy and client outcome: An updated meta-analysis.Psychotherapy, 55(4):399, 2018

2018

[52] [52]

Borelli, Leah Sohn, Beverly A

Jessica L. Borelli, Leah Sohn, Beverly A. Wang, Kathy Hong, Cassandra DeCoste, and Nancy E. Suchman. Therapist–client language matching: Initial promise as a measure of therapist–client relationship quality. Psychoanalytic Psychology, 36(1):9, 2019

2019

[53] [53]

Norcross and Michael J

John C. Norcross and Michael J. Lambert. Psychotherapy relationships that work III.Psychotherapy, 55(4):303, 2018

2018

[54] [54]

G. O. Boateng, T. B. Neilands, E. A. Frongillo, H. R. Melgar-Quinonez, and S. L. Young. Best practices for developing and validating scales for health, social, and behavioral research: a primer.Frontiers in Public Health, 6:149, 2018

2018

[55] [55]

Spearman

C. Spearman. The proof and measurement of association between two things.American Journal of Psychology, 15(1):72–101, 1904

1904

[56] [56]

Virtanen, R

P. Virtanen, R. Gommers, T. E. Oliphant, et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python.Nature Methods, 17:261–272, 2020

2020

[57] [57]

M. G. Kendall and A. Stuart.The Advanced Theory of Statistics, Vol. 2: Inference and Relationship. Charles Griffin, 3 edition, 1973

1973

[58] [58]

Cohen.Statistical Power Analysis for the Behavioral Sciences

J. Cohen.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988

1988

[59] [59]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33:159–174, 1977

1977

[60] [60]

Miller et al

William R. Miller et al. Manual for the motivational interviewing skill code (misc), version 2. University of New Mexico, 2003

2003

[61] [61]

Moyers et al

Theresa B. Moyers et al. Motivational interviewing treatment integrity coding manual 4.1. University of New Mexico, 2014

2014

[62] [62]

Feinstein and Domenic V

Alvan R. Feinstein and Domenic V . Cicchetti. High agreement but low kappa.Journal of Clinical Epidemiology, 43:543–549, 1990

1990

[63] [63]

Sage, 2004

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage, 2004

2004

[64] [64]

B. Efron. Bootstrap methods: another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

1979

[65] [65]

N/A” indicates that the benchmark does not include psychotherapeutic response evaluation, and “N/C

Y . Benjamini and Y . Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995. A Ethics, Data, and Release The human-evaluation protocol was reviewed and approved by the authors’ Institutional Review Board (details withhel...

1995