pith. sign in

arxiv: 2606.18129 · v1 · pith:YDNZOXWKnew · submitted 2026-06-16 · 💻 cs.HC · cs.AI

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

Pith reviewed 2026-06-26 22:36 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords cognitive atrophyLLM evaluationmental health supportconversational benchmarkAI behaviourclinical schemaresponse patterns
0
0 comments X

The pith

LLMs display moderate-to-high cognitive atrophy in mental-health conversations, now measurable by a 20-attribute clinical schema.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines cognitive atrophy as a behavioural pattern where LLMs in counseling-style interactions supply directive advice, solutions, and validation instead of supporting user reflection and independent coping. It constructs a benchmark from 1,576 real human counseling conversations and applies an expert schema to rate over 42,000 LLM responses for atrophy risk. A reader would care because standard safety and helpfulness tests overlook this dynamic that could erode user autonomy over repeated exchanges. Results show the pattern holds across models in both single-turn and multi-turn settings, with weaker adaptation when users ask for decisions rather than safety checks.

Core claim

Cognitive atrophy is formalized as a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. The Cognitive Atrophy Bench is built from 1,576 fully human-generated counseling conversations and 15,680 turns, with three clinical experts creating a 20-attribute schema that six reviewers apply to produce 5,324 judgments on 42,230 LLM responses. Across five LLMs the data show consistent moderate-to-high atrophy-aligned behaviour, stronger responses to overt safety cues than to solution or decision requests, and recurring patterns of directive advice, problem-solving, recommendations, topic shifts, and validation that may reinforce dependence rat

What carries the argument

The 20-attribute schema spanning user context, response behaviour, and global risk flags, used to compute the User-Input Risk Index and Cognitive Atrophy Risk Index from reviewer judgments.

Load-bearing premise

The 20-attribute schema developed by clinical experts accurately captures a distinct process called cognitive atrophy that is separate from safety and helpfulness.

What would settle it

A controlled study in which users exposed to high-atrophy versus low-atrophy responses show no measurable difference in subsequent independent reflection or decision-making would falsify the schema's validity as a distinct measure.

Figures

Figures reproduced from arXiv: 2606.18129 by Abeer Badawi, Elham Dolatabadi, Frank Rudzicz, Laleh Seyyed-Kalantari, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, R. Shayna Rosenbaum, Sara Pishdadian.

Figure 1
Figure 1. Figure 1: Overview of the COGNITIVE ATROPHY BENCH annotation pipeline, including user-context scoring, response-behaviour evaluation, binary risk flags, and span-grounded evidence. vulnerable users [4, 5, 6]. These cases expose a critical evaluation gap: benchmark scores alone do not reveal how models behave. We argue that the central risk is not only whether a response is unsafe, but whether repeated interactions r… view at source ↗
Figure 2
Figure 2. Figure 2: The behavioural attributes used in COGNITIVE ATROPHY BENCH. User-context attributes (U) characterize the clinical demands of the input message; response-behaviour attributes (R) charac￾terize observable LLM response patterns; binary flags (F) capture global risk events. is actively deployed in consumer-facing products that people already use for mental health support online, ensuring that the findings bear… view at source ↗
Figure 3
Figure 3. Figure 3: User Input Risk Index (UIRI) band (Low < 0.30, Medium ∈ [0.30, 0.60), High ≥ 0.60) [54]. with "Low" remaining minimal as shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Top-4 strongest input–response correlations ranked by [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-attribute change from the opening turn, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-cluster spyder charts of spans highlight across LLMs. Each panel groups the highlight [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of the 22 mental-health topic labels in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Complete span-grounded annotation example for a single-turn counseling prompt. The [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 8
Figure 8. Figure 8: Complete span-grounded annotation example for a single-turn counseling prompt, contin [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-conversation UIRI(t) with linear-trend overlay. Each subplot is one conversation; line colour indicates slope sign (red = positive, blue = non-positive). Benjamini–Hochberg FDR control [65], q(k)= min minj≥k(Kp(j)/j), 1  , and declare a cell signifi￾cant when |ρ|≥0.20 AND q<0.05. Multi-turn analysis adds three correlation scopes: pooled (all 720 turn-units, headline; 250 cells), per-turn (nc=72 at eac… view at source ↗
Figure 10
Figure 10. Figure 10: Per-model Spearman ρ between user-input attributes (U1–U5, rows) and LLM response attributes (R1–R10, columns) for each of the five evaluated LLMs. Outlined cells: |ρ|≥0.20 AND BH-FDR q<0.05. Per-model raw correlation matrices [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-attribute change from the opening turn, [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
read the original abstract

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces 'cognitive atrophy' as a distinct process-level behavioral measure in LLM mental-health support interactions, separate from safety and helpfulness. It presents COGNITIVE ATROPHY BENCH built from 1,576 human-generated counseling conversations (15,680 turns), a 20-attribute schema developed by three clinical/neuropsychology experts and applied by six reviewers to yield 5,324 judgments, plus new indices (UIRI, ARI) and trajectory summaries. Across five LLMs the work reports consistent moderate-to-high atrophy-aligned behavior, with dominant patterns of directive advice, problem-solving, recommendations, topic shifts, and validation that may reinforce dependence.

Significance. If the schema is shown to capture distinct variance, the work supplies a clinically grounded benchmark and auditing framework for LLM behavior in sensitive, multi-turn settings that existing safety or helpfulness metrics miss. The scale of expert-driven annotation on fully human data and the introduction of process-level indices are concrete strengths that could support future model evaluation and design.

major comments (3)
  1. [Schema development section] The claim that the 20-attribute schema measures a process distinct from safety and helpfulness (Abstract; schema description) rests on expert development and 5,324 judgments but supplies no factor analysis, correlations with existing safety/helpfulness benchmarks, or unique-variance evidence. This distinctness is load-bearing for the central formalization of cognitive atrophy.
  2. [Annotation and results sections] No inter-rater reliability statistics (e.g., Cohen’s κ or ICC) or controls for selection bias in the 1,576 conversations are reported, yet the paper asserts moderate-to-high atrophy levels across models and settings (Abstract; results).
  3. [Results reporting] The reported moderate-to-high atrophy-aligned behavior lacks accompanying statistical tests, confidence intervals, or effect-size measures that would substantiate the cross-model and single/multi-turn consistency claims.
minor comments (2)
  1. [Index definitions] Explicit formulas or pseudocode for UIRI and ARI would improve reproducibility.
  2. [Tables and figures] Figure captions and table legends could more clearly indicate which attributes map to user context, response behaviour, and global risk flags.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the manuscript. We address each major comment point-by-point below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: The claim that the 20-attribute schema measures a process distinct from safety and helpfulness (Abstract; schema description) rests on expert development and 5,324 judgments but supplies no factor analysis, correlations with existing safety/helpfulness benchmarks, or unique-variance evidence. This distinctness is load-bearing for the central formalization of cognitive atrophy.

    Authors: The schema was developed by three experts in clinical and neuropsychology specifically to capture process-level behaviors in mental-health support interactions that are not addressed by standard safety or helpfulness metrics. While we agree that empirical validation such as factor analysis or correlations would strengthen the claim of distinctness, the current work relies on the expert construction and the volume of judgments. In revision, we will include an analysis of correlations with available model safety scores where applicable and add a dedicated limitations subsection discussing the need for further validation of distinct variance. This addresses the concern without altering the core contribution. revision: partial

  2. Referee: No inter-rater reliability statistics (e.g., Cohen’s κ or ICC) or controls for selection bias in the 1,576 conversations are reported, yet the paper asserts moderate-to-high atrophy levels across models and settings (Abstract; results).

    Authors: We acknowledge the omission of inter-rater reliability statistics. The six reviewers were trained clinical professionals applying the schema with span-grounded evidence. We will compute and report appropriate reliability metrics such as Cohen’s κ in the revised annotation section. For selection bias, the 1,576 conversations were drawn from established public counseling conversation datasets; we will expand the methods to detail the sampling strategy and any stratification used. These additions will be included in the revision. revision: yes

  3. Referee: The reported moderate-to-high atrophy-aligned behavior lacks accompanying statistical tests, confidence intervals, or effect-size measures that would substantiate the cross-model and single/multi-turn consistency claims.

    Authors: We agree that the results would benefit from additional statistical rigor. In the revised results section, we will incorporate statistical tests for model comparisons, confidence intervals around the reported atrophy levels, and effect size measures to support the claims of consistency across models and settings. This will provide a more robust substantiation of the findings. revision: yes

Circularity Check

0 steps flagged

No circularity: new indices constructed from expert schema without reduction to fitted inputs or self-citations

full rationale

The paper defines cognitive atrophy as a new construct via an expert-developed 20-attribute schema, then applies it to produce UIRI, ARI, and trajectory summaries on LLM outputs. These steps are definitional and measurement-oriented rather than predictive; the indices are computed directly from the schema judgments on the collected conversations and do not reduce by construction to quantities fitted from the same data or to any self-citation chain. No equations, uniqueness theorems, or ansatzes are imported from prior author work to force the central distinction from safety/helpfulness. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the validity of the expert schema and the assumption that the selected human conversations represent realistic mental-health interactions; no free parameters are described, but the new concept and indices constitute invented entities without external falsifiable handles mentioned.

axioms (1)
  • domain assumption The 20-attribute schema developed by three experts accurately measures cognitive atrophy as distinct from safety and helpfulness
    Invoked when the schema is used to generate the 5,324 reviewer judgments that support the reported atrophy levels.
invented entities (3)
  • Cognitive Atrophy no independent evidence
    purpose: Process-level behavioural measure for reduction in user reflection and decision-making during LLM mental-health interactions
    Newly formalized concept to address the evaluation gap described in the abstract.
  • User-Input Risk Index (UIRI) no independent evidence
    purpose: Quantify risk level in user inputs for atrophy analysis
    New index introduced alongside the benchmark.
  • Cognitive Atrophy Risk Index (ARI) no independent evidence
    purpose: Quantify atrophy risk from model responses
    New index introduced alongside the benchmark.

pith-pipeline@v0.9.1-grok · 5858 in / 1590 out tokens · 34336 ms · 2026-06-26T22:36:39.454611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 1 linked inside Pith

  1. [1]

    Vaidyam, Hannah Wisniewski, John D

    Aditya N. Vaidyam, Hannah Wisniewski, John D. Halamka, Matcheri S. Kashavan, and John B. Torous. Chatbots and conversational agents in mental health: A review of the psychiatric landscape.Canadian Journal of Psychiatry, 64(7):456–464, 2019

  2. [2]

    Dunn, Huong Ly Tong, et al

    Liliana Laranjo, Adam G. Dunn, Huong Ly Tong, et al. Conversational agents in healthcare: A systematic review.Journal of Medical Internet Research, 20(5):e124, 2018

  3. [3]

    Position: Beyond assistance–reimagining llms as ethical and adaptive co-creators in mental health care

    Abeer Badawi, Md Tahmid Rahman Laskar, Jimmy Xiangji Huang, Shaina Raza, and Elham Dolatabadi. Position: Beyond assistance–reimagining llms as ethical and adaptive co-creators in mental health care. arXiv preprint arXiv:2503.16456, 2025

  4. [4]

    Incident 1192: 16-year-old allegedly received suicide-related guidance from chatgpt, 2025

    AI Incident Database. Incident 1192: 16-year-old allegedly received suicide-related guidance from chatgpt, 2025

  5. [5]

    Incident report: Openai chatgpt and suicide-related harms (incident id: 1106), 2025

    AI Incident Database. Incident report: Openai chatgpt and suicide-related harms (incident id: 1106), 2025

  6. [6]

    ai psychosis

    A. Hudon and E. Stip. Delusional experiences emerging from ai chatbot interactions or “ai psychosis”. JMIR Mental Health, 12:e85799, 2025

  7. [7]

    Risko and Sam J

    Evan F. Risko and Sam J. Gilbert. Cognitive offloading.Trends in Cognitive Sciences, 20(9):676–688, 2016

  8. [8]

    Betsy Sparrow, Jenny Liu, and Daniel M. Wegner. Google effects on memory: Cognitive consequences of having information at our fingertips.Science, 333(6043):776–778, 2011

  9. [9]

    Meyerhoff

    Sandra Grinschgl, Frank Papenmeier, and Hauke S. Meyerhoff. Consequences of cognitive offloading: Boosting performance but diminishing memory.Quarterly Journal of Experimental Psychology, 2021

  10. [10]

    Wood, Jerome S

    David J. Wood, Jerome S. Bruner, and Gail Ross. The role of tutoring in problem solving.Journal of Child Psychology and Psychiatry, 17(2):89–100, 1976

  11. [11]

    Miller and Stephen Rollnick.Motivational Interviewing: Helping People Change

    William R. Miller and Stephen Rollnick.Motivational Interviewing: Helping People Change. Guilford Press, New York, NY , 3 edition, 2013. 10

  12. [12]

    Haoan Jin, Siyuan Chen, Mengyue Wu, and Kenny Q. Zhu. PsyEval: A suite of mental health related tasks for evaluating large language models.arXiv preprint arXiv:2311.09189, 2023

  13. [13]

    Mhqa: A diverse, knowledge intensive mental health question answering challenge for language models.arXiv preprint arXiv:2502.15418, 2025

    Suraj Racha, Prashant Joshi, Anshika Raman, Nikita Jangid, Mridul Sharma, Ganesh Ramakrishnan, and Nirmal Punjabi. Mhqa: A diverse, knowledge intensive mental health question answering challenge for language models.arXiv preprint arXiv:2502.15418, 2025

  14. [14]

    Conceptpsy: A comprehensive benchmark suite for hierarchical psychological concept understanding in llms.Neurocomputing, 637:130070, 2025

    Junlei Zhang, Hongliang He, Lizhi Ma, Nirui Song, Shuyuan He, Shuai Zhang, Huachuan Qiu, Zhan- chao Zhou, Anqi Li, Yong Dai, et al. Conceptpsy: A comprehensive benchmark suite for hierarchical psychological concept understanding in llms.Neurocomputing, 637:130070, 2025

  15. [15]

    Soled, Michael L

    Viet Cuong Nguyen, Mohammad Taher, Dongwan Hong, Vinicius Konkolics Possobom, Vibha Thirunellayi Gopalakrishnan, Ekta Raj, Zihang Li, Heather J. Soled, Michael L. Birnbaum, Srijan Kumar, et al. Do large language models align with core mental health counseling competencies?arXiv preprint arXiv:2410.22446, 2024

  16. [16]

    Bunyi, Adam C

    Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, and Ruishan Liu. Counselbench: A large-scale expert evaluation and adversarial benchmark of large language models in mental health counseling.arXiv preprint arXiv:2506.08584, 2025

  17. [17]

    Being kind isn’t always being safe: Diagnosing affective hallucination in llms.arXiv preprint arXiv:2508.16921, 2026

    Sewon Kim, Jiwon Kim, Seungwoo Shin, Hyejin Chung, Daeun Moon, Yejin Kwon, and Hyunsoo Yoon. Being kind isn’t always being safe: Diagnosing affective hallucination in llms.arXiv preprint arXiv:2508.16921, 2026

  18. [18]

    Pair: Prompt-aware margin ranking for counselor reflection scoring in motivational interviewing

    Do June Min, Verónica Pérez-Rosas, Kenneth Resnicow, and Rada Mihalcea. Pair: Prompt-aware margin ranking for counselor reflection scoring in motivational interviewing. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 148–158, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics

  19. [19]

    Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations

    Ganeshan Malhotra, Abdul Waheed, Ashutosh Srivastava, Md Shad Akhtar, and Tanmoy Chakraborty. Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations. InProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, pages 735–745. Association for Computing Machinery, 2022

  20. [20]

    Anno-mi: A dataset of expert-annotated counselling dialogues

    Zixiu Wu, Simone Balloccu, Vivek Kumar, Rim Helaoui, Ehud Reiter, Diego Reforgiato Recupero, and Daniele Riboni. Anno-mi: A dataset of expert-annotated counselling dialogues. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6177–6181, 2022

  21. [21]

    experiencing

    Antonio Pascual-Leone and Nadiia Yeryomenko. The client “experiencing” scale as a predictor of treatment outcomes: A meta-analysis on psychotherapy process.Psychotherapy Research, 27(6):653–665, 2017

  22. [22]

    Bolton, David Gunnell, and Gustavo Turecki

    James M. Bolton, David Gunnell, and Gustavo Turecki. Suicide risk assessment and intervention in people with mental illness.BMJ, 351, 2015

  23. [23]

    A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement

    Surjodeep Sarkar, Manas Gaur, Lujie Karen Chen, Muskan Garg, and Biplav Srivastava. A review of the explainability and safety of conversational agents for mental health to identify avenues for improvement. Frontiers in Artificial Intelligence, 6:1229805, 2023

  24. [24]

    How llm counselors violate ethical standards in mental health practice

    Zainab Iftikhar, Annie Xiao, Sarah Ransom, Jeff Huang, and Harini Suresh. How llm counselors violate ethical standards in mental health practice. InAAAI/ACM Conference on AI, Ethics, and Society, 2025

  25. [25]

    Ong, and Nick Haber

    Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, and Nick Haber. Expressing stigma and inappropriate responses prevents llms from safely replacing mental health providers. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, pages 1743–1757. Association for Computing M...

  26. [26]

    A comparison of responses from human therapists and large language model–based chatbots to assess therapeutic communication: Mixed methods study.JMIR Mental Health, 12:e69709, 2025

    Till Scholich, Maya Barr, Shannon Wiltsey Stirman, and Shriti Raj. A comparison of responses from human therapists and large language model–based chatbots to assess therapeutic communication: Mixed methods study.JMIR Mental Health, 12:e69709, 2025

  27. [27]

    When can we trust llms in mental health? large-scale benchmarks for reliable llm evaluation.arXiv preprint arXiv:2510.19032, 2025

    Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi. When can we trust llms in mental health? large-scale benchmarks for reliable llm evaluation.arXiv preprint arXiv:2510.19032, 2025

  28. [28]

    Assessing the quality of mental health support in llm responses through multi-attribute human evaluation

    Abeer Badawi, Md Tahmid Rahman Laskar, Elahe Rahimi, Sheri Grach, Lindsay Bertrand, Lames Danok, Frank Rudzicz, Jimmy Huang, and Elham Dolatabadi. Assessing the quality of mental health support in llm responses through multi-attribute human evaluation. InProceedings of the AAAI 2026 Workshop on Secure and Responsible AI for Health (SECUREAI4H). Associatio...

  29. [29]

    Richard Wohl

    Donald Horton and R. Richard Wohl. Mass communication and para-social interaction: Observations on intimacy at a distance.Psychiatry, 19(3):215–229, 1956

  30. [30]

    My ai friend: How users of a social chatbot understand their human–ai friendship.Human Communication Research, 48(3):404–429, 2022

    Petter Bae Brandtzaeg, Marita Skjuve, and Asbjørn Følstad. My ai friend: How users of a social chatbot understand their human–ai friendship.Human Communication Research, 48(3):404–429, 2022

  31. [31]

    Attachment theory as a framework to understand relationships with social chatbots: A case study of replika

    Tianling Xie and Iryna Pentina. Attachment theory as a framework to understand relationships with social chatbots: A case study of replika. InProceedings of the 55th Hawaii International Conference on System Sciences, 2022

  32. [32]

    Chiu, Shaun M

    Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C. Chiu, Shaun M. Eack, Fei Fang, William Yang Wang, and Zhiyu Zoey Chen. Cbt-bench: Evaluating large language models on assisting cognitive behavior therapy.arXiv preprint arXiv:2410.13218, 2024. 2024c

  33. [33]

    Exploring safety alignment evaluation of llms in chinese mental health dialogues via llm-as-judge.arXiv preprint arXiv:2508.08236, 2025

    Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, and Mingming Fan. Exploring safety alignment evaluation of llms in chinese mental health dialogues via llm-as-judge.arXiv preprint arXiv:2508.08236, 2025

  34. [34]

    Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling.arXiv preprint arXiv:2405.16433, 2024

    Chenhao Zhang, Renhao Li, Minghuan Tan, Min Yang, Jingwei Zhu, Di Yang, Jiahao Zhao, Guancheng Ye, Chengming Li, and Xiping Hu. Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling.arXiv preprint arXiv:2405.16433, 2024

  35. [35]

    Mentalchat16k: A benchmark dataset for conversational mental health assistance

    Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, and Li Shen. Mentalchat16k: A benchmark dataset for conversational mental health assistance. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) V . 2, pages 5367–5378, 2025

  36. [36]

    Can llms move beyond short exchanges to realistic therapy conversations? InThe Fourteenth International Conference on Learning Representations, 2026

    Zhengqing Yuan, Liang Wu, Jian Xu, Zheyuan Zhang, Kaiwen Shi, Weixiang Sun, Lichao Sun, and Yanfang Ye. Can llms move beyond short exchanges to realistic therapy conversations? InThe Fourteenth International Conference on Learning Representations, 2026

  37. [37]

    C. Konnoth. AI and data protection law in health. In B. Solaiman and I. G. Cohen, editors,Research Handbook on Health, AI and the Law. Edward Elgar Publishing, Cheltenham, UK, 2024

  38. [38]

    Counsel chat: Bootstrapping high-quality therapy data

    Nicolas Bertagnolli. Counsel chat: Bootstrapping high-quality therapy data. https://huggingface. co/datasets/nbertagnolli/counsel-chat, 2020. Hugging Face dataset

  39. [39]

    Bischof, A

    G. Bischof, A. Bischof, and H.-J. Rumpf. Motivational interviewing: An evidence-based approach for use in medical practice.Deutsches Ärzteblatt International, 118:109–115, 2021

  40. [40]

    A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

    Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. A scoping review of large language models for generative tasks in mental health care.npj Digital Medicine, 8(1):230, 2025

  41. [41]

    GPT-4o system card.https://openai.com/index/gpt-4o-system-card, 2025

    OpenAI. GPT-4o system card.https://openai.com/index/gpt-4o-system-card, 2025

  42. [42]

    Claude sonnet 4 model card.https://www.anthropic.com/claude/sonnet, 2025

    Anthropic. Claude sonnet 4 model card.https://www.anthropic.com/claude/sonnet, 2025

  43. [43]

    Gemini 2.5 flash

    Google DeepMind. Gemini 2.5 flash. https://deepmind.google/technologies/gemini/flash, 2025

  44. [44]

    Llama 4: The next generation of meta’s open foundation models

    Meta AI. Llama 4: The next generation of meta’s open foundation models. https://ai.meta.com/ blog/llama-4-multimodal-intelligence, 2025

  45. [45]

    Qwen3 technical report.https://qwenlm.github.io/blog/qwen3, 2025

    Qwen Team. Qwen3 technical report.https://qwenlm.github.io/blog/qwen3, 2025

  46. [46]

    Marsha M. Linehan. Dialectical behavior therapy for treatment of borderline personality disorder: Implica- tions for the treatment of substance abuse.NIDA Research Monograph, 137:201, 1993

  47. [47]

    Pim Cuijpers, Mirjam Reijnders, and Marcus J. H. Huibers. The role of common factors in psychotherapy outcomes.Annual Review of Clinical Psychology, 15(1):207–231, 2019

  48. [48]

    Wampold and Zac E

    Bruce E. Wampold and Zac E. Imel.The Great Psychotherapy Debate: The Evidence for What Makes Psychotherapy Work. Routledge, 2015

  49. [49]

    Westra and Nahal Norouzian

    Henny A. Westra and Nahal Norouzian. Using motivational interviewing to manage process markers of ambivalence and resistance in cognitive behavioral therapy.Cognitive Therapy and Research, 42(2):193– 203, 2018. 12

  50. [50]

    The art of tentativity: Delivering interpretations in psychodynamic psychotherapy.Journal of Pragmatics, 176:76–96, 2021

    Anja Stukenbrock, Arnulf Deppermann, and Carl Eduard Scheidt. The art of tentativity: Delivering interpretations in psychodynamic psychotherapy.Journal of Pragmatics, 176:76–96, 2021

  51. [51]

    Bohart, Jeanne C

    Robert Elliott, Arthur C. Bohart, Jeanne C. Watson, and David Murphy. Therapist empathy and client outcome: An updated meta-analysis.Psychotherapy, 55(4):399, 2018

  52. [52]

    Borelli, Leah Sohn, Beverly A

    Jessica L. Borelli, Leah Sohn, Beverly A. Wang, Kathy Hong, Cassandra DeCoste, and Nancy E. Suchman. Therapist–client language matching: Initial promise as a measure of therapist–client relationship quality. Psychoanalytic Psychology, 36(1):9, 2019

  53. [53]

    Norcross and Michael J

    John C. Norcross and Michael J. Lambert. Psychotherapy relationships that work III.Psychotherapy, 55(4):303, 2018

  54. [54]

    G. O. Boateng, T. B. Neilands, E. A. Frongillo, H. R. Melgar-Quinonez, and S. L. Young. Best practices for developing and validating scales for health, social, and behavioral research: a primer.Frontiers in Public Health, 6:149, 2018

  55. [55]

    Spearman

    C. Spearman. The proof and measurement of association between two things.American Journal of Psychology, 15(1):72–101, 1904

  56. [56]

    Virtanen, R

    P. Virtanen, R. Gommers, T. E. Oliphant, et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python.Nature Methods, 17:261–272, 2020

  57. [57]

    M. G. Kendall and A. Stuart.The Advanced Theory of Statistics, Vol. 2: Inference and Relationship. Charles Griffin, 3 edition, 1973

  58. [58]

    Cohen.Statistical Power Analysis for the Behavioral Sciences

    J. Cohen.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988

  59. [59]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33:159–174, 1977

  60. [60]

    Miller et al

    William R. Miller et al. Manual for the motivational interviewing skill code (misc), version 2. University of New Mexico, 2003

  61. [61]

    Moyers et al

    Theresa B. Moyers et al. Motivational interviewing treatment integrity coding manual 4.1. University of New Mexico, 2014

  62. [62]

    Feinstein and Domenic V

    Alvan R. Feinstein and Domenic V . Cicchetti. High agreement but low kappa.Journal of Clinical Epidemiology, 43:543–549, 1990

  63. [63]

    Sage, 2004

    Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage, 2004

  64. [64]

    B. Efron. Bootstrap methods: another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

  65. [65]

    N/A” indicates that the benchmark does not include psychotherapeutic response evaluation, and “N/C

    Y . Benjamini and Y . Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995. A Ethics, Data, and Release The human-evaluation protocol was reviewed and approved by the authors’ Institutional Review Board (details withhel...