Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

Beining Xu; Hanbo Zhang; Tianze Han; Yongming Lu

REVIEW 2 major objections 1 minor 1 cited by

Dynamic Emotional Signature Graphs detect implicit sycophancy in mental-health dialogues by scoring clinical-state transitions on a leakage-audited benchmark.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-07-01 00:21 UTC pith:IKVP6CRT

load-bearing objection The leakage-audited benchmark is a concrete step forward, but the 0.0488 F1 gain rests on unvalidated LLM state extraction that may carry the same bias. the 2 major comments →

arxiv 2605.03472 v2 pith:IKVP6CRT submitted 2026-05-05 cs.CL cs.AI

Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

Tianze Han , Beining Xu , Hanbo Zhang , Yongming Lu This is my paper

classification cs.CL cs.AI

keywords implicit sycophancymental-health dialogueclinical-state diagnosticscognitive distortionharmful-risk detectionmatched benchmarkstate transitionsDynamic Emotional Signature Graphs

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mental-health dialogue responses can appear empathetic while implicitly reinforcing patterns such as catastrophizing, avoidance, or hopeless prediction. The paper builds a diagnostic benchmark from peer support, counseling, and crisis sources, then creates a leakage-audited clean matched set of 500 contexts and 1,500 response windows. It proposes Dynamic Emotional Signature Graphs, which extract semantic, affective, and cognitive-distortion states via LLM and score the direction of clinical change induced by each response. On this benchmark, the DESG-StateRisk variant improves macro-F1 by 0.0488 over the strongest non-DESG baseline and leads in harmful-risk detection. The work shows that reliable detection of this hidden failure mode needs explicit clinical-state modeling plus controls for leakage and shortcuts.

Core claim

The paper establishes that DESG, by separating LLM-based state extraction from scoring and evaluating the direction of semantic, affective, and cognitive-distortion state transitions, outperforms metadata, surface-style, lexical, embedding, and rubric-LLM baselines; on the leakage-audited clean matched benchmark it improves macro-F1 by 0.0488 and achieves the best harmful-risk detection result.

What carries the argument

Dynamic Emotional Signature Graphs (DESG), a structured offline audit framework that extracts semantic, affective, and cognitive-distortion states via LLM and scores clinical direction through state transitions rather than free-form judgment.

Load-bearing premise

The LLM-based extraction of semantic, affective, and cognitive-distortion states produces reliable clinical direction signals that are not themselves biased by the sycophancy patterns being detected.

What would settle it

A test in which the state-extraction step is shown to reinforce the same harmful patterns or in which DESG-StateRisk loses its performance edge on an independently constructed clean matched benchmark.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Evaluating implicit sycophancy requires explicit clinical-state modeling together with leakage checks, shortcut controls, and competitive baselines.
Surface-style, lexical, embedding, and rubric-LLM baselines are outperformed when direction of clinical-state change is scored directly.
A clean single-response matched benchmark built from everyday, counseling-style, and crisis sources enables more reliable harmful-risk detection.
Three representative dialogue sources provide coverage across peer support, emotional support, and crisis-oriented interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Mental-health chatbot developers could embed similar state-transition audits into evaluation pipelines to reduce unintended reinforcement of distortions.
The state-transition approach may extend to detecting subtle reinforcing biases in other dialogue settings such as educational or advisory conversations.
Replacing the LLM extractor with domain-specific clinical models or human annotators could increase reliability while preserving the graph structure.
Widespread use would shift safety standards for therapeutic AI from empathy-focused metrics toward measurable clinical direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The leakage-audited benchmark is a concrete step forward, but the 0.0488 F1 gain rests on unvalidated LLM state extraction that may carry the same bias.

read the letter

The clean matched benchmark is the part worth noting. They pulled contexts from three mental-health sources, built 500 contexts with 1,500 matched response windows, and added leakage audits plus shortcut controls. That setup is more disciplined than most surface-style or rubric baselines.

DESG separates LLM state extraction (semantic, affective, cognitive-distortion) from the final scoring of clinical direction. On the clean set it edges the strongest non-DESG baseline by 0.0488 macro-F1 and leads on harmful-risk detection. The design choice to avoid free-form LLM judgment is sensible.

The lift is small and reported without error bars or details on how the response windows were sampled or validated. The larger gap is that the paper gives no accuracy numbers or clinician agreement for the state extractor, and no test for whether the extractor itself introduces sycophantic distortions. If it does, the reported improvement could be an artifact.

This is for researchers building or auditing dialogue systems for emotional support. The benchmark construction itself is the piece that could travel.

Send it to peer review. The controls and multi-source matching are explicit enough to merit referee time, even though the extraction step needs direct validation evidence.

Referee Report

2 major / 1 minor

Summary. The paper claims that implicit sycophancy (responses that appear empathetic while reinforcing harmful cognitive patterns) in mental-health dialogue can be audited via a new leakage-audited clean matched benchmark (500 contexts, 1,500 response windows from three dialogue sources) and the DESG framework, which separates LLM-based extraction of semantic/affective/cognitive-distortion states from scoring of clinical-state transitions; on this benchmark DESG-StateRisk yields a 0.0488 macro-F1 gain over the strongest non-DESG baseline and the best harmful-risk result.

Significance. If the extraction step is shown to be reliable, the work supplies a structured, direction-aware alternative to surface-style or free-form LLM evaluators and demonstrates the value of explicit leakage controls and matched benchmarks; the emphasis on clinical-state transitions rather than metadata or lexical cues is a constructive direction for safety auditing.

major comments (2)

[Abstract / DESG framework description] Abstract / DESG framework: the reported 0.0488 macro-F1 improvement and best harmful-risk result rest on LLM extraction of semantic, affective, and cognitive-distortion states, yet no extraction accuracy, clinician inter-annotator agreement, or bias audit of the extractor itself is provided; this is load-bearing because any sycophancy bias in the extractor LLM would systematically distort the state-transition signals that DESG scores.
[Benchmark construction] Benchmark section (implied by abstract): the construction of the 1,500 matched response windows and the post-hoc cleaning procedure are described only at high level, with no error bars, statistical significance tests, or sensitivity analysis for the 0.0488 macro-F1 delta; without these the modest lift cannot be distinguished from sampling or cleaning artifacts.

minor comments (1)

[Abstract] The three source datasets are referred to only generically; explicit names and citations should be supplied when first introduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important aspects of validation and statistical rigor. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract / DESG framework description] Abstract / DESG framework: the reported 0.0488 macro-F1 improvement and best harmful-risk result rest on LLM extraction of semantic, affective, and cognitive-distortion states, yet no extraction accuracy, clinician inter-annotator agreement, or bias audit of the extractor itself is provided; this is load-bearing because any sycophancy bias in the extractor LLM would systematically distort the state-transition signals that DESG scores.

Authors: We agree that the reliability of the LLM-based state extraction step is foundational and that its absence represents a gap. The current manuscript emphasizes the overall DESG framework and benchmark results but does not report extractor-level validation metrics. We will revise the paper to add a dedicated subsection detailing: (i) accuracy of semantic, affective, and cognitive-distortion state extraction against clinician-annotated gold labels on a held-out subset; (ii) inter-annotator agreement (e.g., Cohen's kappa) among multiple clinicians; and (iii) a targeted bias audit for sycophantic tendencies in the extractor outputs. These additions will be placed in the Methods section and will include the annotation protocol and sample size. revision: yes
Referee: [Benchmark construction] Benchmark section (implied by abstract): the construction of the 1,500 matched response windows and the post-hoc cleaning procedure are described only at high level, with no error bars, statistical significance tests, or sensitivity analysis for the 0.0488 macro-F1 delta; without these the modest lift cannot be distinguished from sampling or cleaning artifacts.

Authors: We acknowledge that the benchmark construction and the statistical characterization of the performance delta are presented at a summary level. We will expand the relevant section to include: a more granular description of the matching procedure across the three dialogue sources and the post-hoc cleaning steps (including explicit criteria and any automated filters); bootstrapped or cross-run error bars around the macro-F1 scores; results of statistical significance tests comparing DESG-StateRisk to the strongest baseline; and sensitivity analyses that vary key parameters such as context window size, cleaning thresholds, and source proportions. These changes will clarify that the reported 0.0488 improvement is not an artifact of sampling or cleaning choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on constructed benchmark are independent of fitted parameters or self-citations

full rationale

The paper presents an empirical study introducing a new diagnostic benchmark and the DESG framework, which separates LLM-based state extraction from downstream scoring of semantic/affective/cognitive-distortion transitions. Reported gains (0.0488 macro-F1) are measured performance on the leakage-audited matched benchmark against baselines; no equations, fitted parameters, or self-citation chains are shown that would make the central claims equivalent to their inputs by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that LLM-extracted clinical states are sufficiently unbiased to serve as ground truth for direction scoring; the paper introduces the DESG framework as a new structured method without citing prior formalization of its state-transition rules.

axioms (1)

domain assumption LLM-based extraction of semantic, affective, and cognitive-distortion states produces consistent clinical direction signals across response windows
Invoked in the description of DESG separating state extraction from final scoring

invented entities (1)

Dynamic Emotional Signature Graphs (DESG) no independent evidence
purpose: Structured offline audit that models clinical-state transitions rather than free-form judgment
New framework proposed in the paper; no independent evidence outside this work is provided in the abstract

pith-pipeline@v0.9.1-grok · 5796 in / 1465 out tokens · 30409 ms · 2026-07-01T00:21:29.679426+00:00 · methodology

0 comments

read the original abstract

Mental-health dialogue models are increasingly evaluated by AI-based evaluators, yet these evaluators often treat surface empathy, supportiveness, or fluency as evidence of safety. In this paper, we study a hidden failure mode that we call implicit sycophancy: a response may appear empathetic while implicitly reinforcing catastrophizing, avoidance, hopeless prediction, or CBT-style labeling. To examine this problem, we introduce a diagnostic benchmark for implicit-sycophancy detection, built from three representative mental-health dialogue sources covering everyday peer support, counseling-style emotional support, and crisis-oriented interaction, and further construct a leakage-audited clean single-response matched benchmark with 500 contexts and 1,500 matched response windows. We then propose Dynamic Emotional Signature Graphs (DESG), a structured offline audit framework that separates LLM-based state extraction from final scoring and evaluates clinical direction through semantic, affective, and cognitive-distortion state transitions rather than free-form LLM judgment. Unlike metadata, surface-style, lexical, embedding, and rubric-LLM baselines, DESG scores the direction of clinical-state change induced by a response; on the leakage-audited clean matched benchmark, DESG-StateRisk improves over the strongest non-DESG baseline by 0.0488 macro-F1 and achieves the best harmful-risk detection result. These results suggest that evaluating implicit sycophancy requires explicit clinical-state modeling together with leakage checks, shortcut controls, and competitive baselines.

Figures

Figures reproduced from arXiv: 2605.03472 by Beining Xu, Hanbo Zhang, Tianze Han, Yongming Lu.

**Figure 1.** Figure 1: Evaluation blind spot for stealth sycophancy, where clinically harmful directionality can appear as supportive surface language. 1 Introduction Conversational AI systems are increasingly being deployed in mental-health support scenarios, raising significant concerns about whether current evaluation methods can reliably identify harmful model behavior[2,14,32]. In these settings, surface-level empathy, fl… view at source ↗

**Figure 2.** Figure 2: DESG pipeline and validity controls, separating state extraction, clinical-state representation, directed graph scoring, and benchmark auditing. 3.1 State Decoupling into a 1548-D Clinical Space DESG begins from the observation that surface language alone is not sufficient for psychological dialogue evaluation. Responses with similar semantic content may lead to different clinical trajectories, especially … view at source ↗

**Figure 2.** Figure 2: DESG workflow and validity controls, separating state extraction, clinical-state representation, directed graph scoring, and benchmark auditing. rubric-style judgment[17,12,14,5]. This is insufficient for implicit-sycophancy detection because two dialogues may share similar empathetic language while moving in opposite clinical directions. For example, movement from despair to tentative agency and moveme… view at source ↗

**Figure 3.** Figure 3: Representative harmful windows missed by the direct LLM judge and official evaluator baselines. Representative failure cases explain why the direct LLM judge and official evaluator baselines miss clinically unsafe directionality, as visualized in view at source ↗

**Figure 4.** Figure 4: Exploratory t-SNE views of pure-text and affective-manifold representations view at source ↗

**Figure 5.** Figure 5: Harmful-window miss patterns for direct and external evaluator baselines. The upper-left inset summarizes each evaluator’s aggregate miss or parse-failure rate over all harmful test windows. Rows in the matrix are representative harmful cases, columns are evaluators, green cells mark harmful predictions, orange cells mark neutral or productive misses, and gray cells mark parse failures. C.2 Representative … view at source ↗

**Figure 6.** Figure 6: Representative state trajectories behind the qualitative disagreement cases. Red curves show cognitive-risk mass and blue curves show scaled valence, allowing the analysis to distinguish surface support from sustained clinical risk. C.3 Parameter Sensitivity Visualization The parameter-sensitivity visualization in view at source ↗

**Figure 7.** Figure 7: Parameter-sensitivity ranges used as a mechanism-claim gate. Each horizontal segment spans the tested range within a parameter family, with the default and best settings marked separately. C.4 Mechanism Sanity Control Visualization The sanity-control visualization in view at source ↗

**Figure 8.** Figure 8: Mechanism sanity-control deltas relative to the default setting. Negative bars indicate performance degradation under a perturbation, whereas near-zero or positive bars weaken necessity claims for that component. C.5 Deep Branch and Ensemble Visualization The deep-branch visualization in view at source ↗

**Figure 9.** Figure 9: Deep-branch and ensemble robustness diagnostics. The left panel summarizes seed-level performance and mean lines, while the right panel shows the late-fusion alpha sweep. D Ethics Statement This work is limited to offline evaluation and red-team auditing of psychological dialogue systems. DESG is not a diagnostic, therapeutic, triage, or crisis-response system, and its outputs must not replace clinicians, … view at source ↗

Review history (3 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Feeling Better: Capability-Sustaining Emotional Dialogue as a Longitudinal Research Paradigm
cs.CL 2026-07 conditional novelty 6.0

Proposes a 'capability-sustaining' longitudinal paradigm for emotional dialogue, backed by an audit showing current systems focus on relief and never measure long-term capability.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Guilford Press (1979)

Beck, A.T., Rush, A.J., Shaw, B.F., Emery, G.: Cognitive Therapy of Depression. Guilford Press (1979)

work page 1979
[2]

Behaviour Research and Therapy70, 32–37 (2015)

Braun, J.D., Strunk, D.R., Sasso, K.E., Cooper, A.A.: Therapist use of socratic questioning predicts session-to-session symptom change in cognitive therapy for depression. Behaviour Research and Therapy70, 32–37 (2015). https://doi.org/ 10.1016/j.brat.2015.05.004

work page doi:10.1016/j.brat.2015.05.004 2015
[3]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Chen, G.H., Chen, S., Liu, Z., Jiang, F., Wang, B.: Humans or LLMs as the judge? a study on judgement bias. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 8301–8327. Association for Compu- tational Linguistics (2024). https://doi.org/10.18653/v1/2024.emnlp-main.474

work page doi:10.18653/v1/2024.emnlp-main.474 2024
[4]

In: Findings of the Association for Computational Linguistics: ACL 2024

Chen, Y., Yan, S., Liu, S., Li, Y., Xiao, Y.: EmotionQueen: A benchmark for evaluating empathy of large language models. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 2149–2176. Association for Computa- tional Linguistics (2024). https://doi.org/10.18653/v1/2024.findings-acl.128

work page doi:10.18653/v1/2024.findings-acl.128 2024
[5]

In: Proceedings of the 63rd Annual MeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers)

Chiang, C.H., Lee, H.y., Lukasik, M.: TRACT: Regression-aware fine-tuning meets chain-of-thought reasoning for LLM-as-a-judge. In: Proceedings of the 63rd Annual MeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers). pp. 2934–2952. Association for Computational Linguistics (2025). https://doi.org/ 10.18653/v1/2025.acl-long.147

work page doi:10.18653/v1/2025.acl-long.147 2025
[6]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

D’Souza, J., Babaei Giglou, H., Münch, Q.: YESciEval: Robust LLM-as-a-judge for scientific question answering. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 13749–13783. Association for Computational Linguistics (2025). https://doi.org/ 10.18653/v1/2025.acl-long.675

work page doi:10.18653/v1/2025.acl-long.675 2025
[7]

Psychotherapy 48(1), 43–49 (2011)

Elliott, R., Bohart, A.C., Watson, J.C., Greenberg, L.S.: Empathy. Psychotherapy 48(1), 43–49 (2011). https://doi.org/10.1037/a0022187 20 T. Han, B. Xu et al

work page doi:10.1037/a0022187 2011
[8]

Suicide and Life-Threatening Behavior 37(3), 338–352 (2007)

Gould, M.S., Kalafat, J., Harris Munfakh, J.L., Kleinman, M.: An evaluation of cri- sis hotline outcomes part 2: Suicidal callers. Suicide and Life-Threatening Behavior 37(3), 338–352 (2007). https://doi.org/10.1521/suli.2007.37.3.338

work page doi:10.1521/suli.2007.37.3.338 2007
[9]

A survey on llm-as-a-judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, Y., Guo, J.: A survey on LLM-as-a-judge. The Innovation7(6), 101253 (2026). https://doi.org/10.1016/j.xinn.2025.101253

work page doi:10.1016/j.xinn.2025.101253 2026
[10]

Cognitive Therapy and Research36(5), 427–440 (2012)

Hofmann, S.G., Asnaani, A., Vonk, I.J.J., Sawyer, A.T., Fang, A.: The efficacy of cognitive behavioral therapy: A review of meta-analyses. Cognitive Therapy and Research36(5), 427–440 (2012). https://doi.org/10.1007/s10608-012-9476-1

work page doi:10.1007/s10608-012-9476-1 2012
[11]

JMIR mHealth and uHealth6(11), e12106 (2018)

Inkster, B., Sarda, S., Subramanian, V.: An empathy-driven, conversational artifi- cial intelligence agent (Wysa) for digital mental well-being: Real-world data eval- uation mixed-methods study. JMIR mHealth and uHealth6(11), e12106 (2018). https://doi.org/10.2196/12106

work page doi:10.2196/12106 2018
[13]

Lee, D., Hwang, Y., Kim, Y., Park, J., Jung, K.: Are LLM-judges robust to expres- sions of uncertainty? investigating the effect of epistemic markers on LLM-based evaluation. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers)....

work page doi:10.18653/v1/2025.naacl-long.452 2025
[15]

In: Findings of the Association for Computational Linguistics: EMNLP

Li, A., Lu, Y., Song, N., Zhang, S., Ma, L., Lan, Z.: Understanding the therapeutic relationship between counselors and clients in online text-based counseling using LLMs. In: Findings of the Association for Computational Linguistics: EMNLP

work page
[16]

1280–1303

pp. 1280–1303. Association for Computational Linguistics (2024). https:// doi.org/10.18653/v1/2024.findings-emnlp.69

work page doi:10.18653/v1/2024.findings-emnlp.69 2024
[18]

In: The Twelfth International Conference on Learning Represen- tations (2024), https://openreview.net/forum?id=gtkFw6sZGS

Li, J., Sun, S., Yuan, W., Fan, R.Z., Zhao, H., Liu, P.: Generative judge for evalu- ating alignment. In: The Twelfth International Conference on Learning Represen- tations (2024), https://openreview.net/forum?id=gtkFw6sZGS

work page 2024
[19]

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Li, Y., Yao, J., Bunyi, J.B.S., Frank, A.C., Hwang, A.H.C., Liu, R.: CounselBench: A large-scale expert evaluation and adversarial benchmarking of large language models in mental health question answering. arXiv preprint arXiv:2506.08584 (2025), https://arxiv.org/abs/2506.08584 Auditing Stealth Sycophancy 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Liu, S., Zheng, C., Demasi, O., Sabour, S., Li, Y., Yu, Z., Jiang, Y., Huang, M.: Towards emotional support dialog systems. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Interna- tional Joint Conference on Natural Language Processing (Volume 1: Long Pa- pers). pp. 3469–3483. Association for Comput...

work page doi:10.18653/v1/2021.acl-long.269 2021
[22]

In: Findings of the Association for Computational Linguistics: ACL 2025

Na, H., Hua, Y., Wang, Z., Shen, T., Yu, B., Wang, L., Wang, W., Torous, J., Chen, L.: A survey of large language models in psychotherapy: Current landscape and future directions. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 7362–7376. Association for Computational Linguistics (2025). https: //doi.org/10.18653/v1/2025.findi...

work page doi:10.18653/v1/2025.findings-acl.385 2025
[24]

LLM Evaluators Recognize and Favor Their Own Generations

Panickssery, A., Bowman, S.R., Feng, S.: LLM evaluators recognize and favor their own generations. In: Advances in Neural Information Processing Systems (2024), https://arxiv.org/abs/2404.13076

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

In: Findings of the Association for Computational Lin- guistics: EMNLP 2024

Park, J., Jwa, S., Meiying, R., Kim, D., Choi, S.: OffsetBias: Leveraging debiased data for tuning evaluators. In: Findings of the Association for Computational Lin- guistics: EMNLP 2024. pp. 1043–1067. Association for Computational Linguistics (2024). https://doi.org/10.18653/v1/2024.findings-emnlp.57

work page doi:10.18653/v1/2024.findings-emnlp.57 2024
[27]

Sentence-bert: Sentence embeddings using siamese bert-networks

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3982–3992. Association for Computational Linguistics (2019). https://doi.org/10.1...

work page doi:10.18653/v1/d19-1410 2019
[28]

Europe’s Journal of Psychology12(3), 348–362 (2016)

Rnic, K., Dozois, D.J.A., Martin, R.A.: Cognitive distortions, humor styles, and depression. Europe’s Journal of Psychology12(3), 348–362 (2016). https://doi. org/10.5964/ejop.v12i3.1118

work page doi:10.5964/ejop.v12i3.1118 2016
[29]

Journal of Consulting Psychology21(2), 95–103 (1957)

Rogers, C.R.: The necessary and sufficient conditions of therapeutic personality change. Journal of Consulting Psychology21(2), 95–103 (1957). https://doi.org/ 10.1037/h0045357

work page doi:10.1037/h0045357 1957
[30]

Child Development 73(6), 1830–1843 (2002)

Rose, A.J.: Co-rumination in the friendships of girls and boys. Child Development 73(6), 1830–1843 (2002). https://doi.org/10.1111/1467-8624.00509

work page doi:10.1111/1467-8624.00509 2002
[31]

Journal of Personality and Social Psychology39, 1161–1178 (12 1980)

Russell, J.A.: A circumplex model of affect. Journal of Personality and Social Psychology39(6), 1161–1178 (1980). https://doi.org/10.1037/h0077714 22 T. Han, B. Xu et al

work page doi:10.1037/h0077714 1980
[32]

In: Chinese Conference on Pattern Recognition and Computer Vision

Shan, G., Ma, X., Bai, X., Zhu, H., Wang, T., Zhu, S., Wang, L.: Dental diagnosis from x-ray panoramic radiography images: A dataset and a hybrid framework. In: Chinese Conference on Pattern Recognition and Computer Vision. pp. 234–248 (2024). https://doi.org/10.1007/978-981-97-8496-7_17

work page doi:10.1007/978-981-97-8496-7_17 2024
[33]

Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., Vosoughi, S.: Judging the judges: A systematic study of position bias in LLM-as-a-judge. In: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Lin- guistics. pp. 292–314. The Asian Federa...

work page 2025
[34]

Hugging Face dataset (2026), https://huggingface.co/datasets/SungJoo/Cradle-Dialogue, dataset card

SungJoo: CRADLE-Dialogue: Crisis-response dialogue dataset. Hugging Face dataset (2026), https://huggingface.co/datasets/SungJoo/Cradle-Dialogue, dataset card

work page 2026
[35]

The Canadian Journal of Psychiatry64(7), 456–464 (2019)

Vaidyam, A.N., Wisniewski, H., Halamka, J.D., Kashavan, M.S., Torous, J.B.: Chatbots and conversational agents in mental health: A review of the psychiatric landscape. The Canadian Journal of Psychiatry64(7), 456–464 (2019). https:// doi.org/10.1177/0706743719828977

work page doi:10.1177/0706743719828977 2019
[37]

Self-Preference Bias in LLM-as-a-Judge

Wataoka, K., Takahashi, T., Ri, R.: Self-preference bias in LLM-as-a-judge. arXiv preprint arXiv:2410.21819 (2024), https://arxiv.org/abs/2410.21819

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

doi: 10.18653/v1/2024.emnlp-main

Watts, I., Gumma, V., Yadavalli, A., Seshadri, V., Swaminathan, M., Sitaram, S.: PARIKSHA: A large-scale investigation of human-LLM evaluator agreement on multilingual and multi-cultural data. In: Proceedings of the 2024 Conference on EmpiricalMethodsinNaturalLanguageProcessing.pp.7900–7932.Associationfor Computational Linguistics (2024). https://doi.org/...

work page doi:10.18653/v1/2024.emnlp-main 2024
[39]

Fine-grained prediction of reading comprehension from eye movements,

Xie, H., Chen, Y., Xing, X., Lin, J., Xu, X.: PsyDT: Using LLMs to construct the digital twin of psychological counselor with personalized counseling style for psychological counseling. In: Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). pp. 1081–1115. Association for Computational Linguis...

work page doi:10.18653/v1/ 2025
[41]

In: Findings of the Association for Computational Linguistics: ACL 2024

Zhang, C., Li, R., Tan, M., Yang, M., Zhu, J., Yang, D., Zhao, J., Ye, G., Li, C., Hu, X.: CPsyCoun: A report-based multi-turn dialogue reconstruction and evaluation framework for Chinese psychological counseling. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 13947–13966. Association for Com- putational Linguistics (2024). h...

work page doi:10.18653/v1/2024.findings-acl.830 2024
[42]

Zhang, M., Yang, X., Zhang, X., Labrum, T., Chiu, J.C., Eack, S.M., Fang, F., Wang, W.Y., Chen, Z.: CBT-bench: Evaluating large language models on assisting Auditing Stealth Sycophancy 23 cognitive behavior therapy. In: Proceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics: Human Langu...

work page doi:10.18653/v1/2025 2025
[43]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zhang, Q., Wang, Y., Jiang, Y., Li, L., Wu, C., Wang, Y., Jiang, X., Shang, L., Tang, R., Lyu, F., Ma, C.: Crowd comparative reasoning: Unlocking comprehensive evaluations for LLM-as-a-judge. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 5059–

work page
[44]

https://doi.org/10.18653/ v1/2025.acl-long.252

Association for Computational Linguistics (2025). https://doi.org/10.18653/ v1/2025.acl-long.252

work page 2025
[45]

In: International Conference on Learning Rep- resentations (2020), https://openreview.net/forum?id=SkeHuCVFDr

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: International Conference on Learning Rep- resentations (2020), https://openreview.net/forum?id=SkeHuCVFDr

work page 2020
[46]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing

Zhao, H., Li, L., Chen, S., Kong, S., Wang, J., Huang, K., Gu, T., Wang, Y., Wang, J., Dandan, L., Li, Z., Teng, Y., Xiao, Y., Wang, Y.: ESC-eval: Evalu- ating emotion support conversations in large language models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing. pp. 15785–15810. Association for Computational ...

work page doi:10.18653/v1/2024.emnlp-main.883 2024

[1] [1]

Guilford Press (1979)

Beck, A.T., Rush, A.J., Shaw, B.F., Emery, G.: Cognitive Therapy of Depression. Guilford Press (1979)

work page 1979

[2] [2]

Behaviour Research and Therapy70, 32–37 (2015)

Braun, J.D., Strunk, D.R., Sasso, K.E., Cooper, A.A.: Therapist use of socratic questioning predicts session-to-session symptom change in cognitive therapy for depression. Behaviour Research and Therapy70, 32–37 (2015). https://doi.org/ 10.1016/j.brat.2015.05.004

work page doi:10.1016/j.brat.2015.05.004 2015

[3] [3]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Chen, G.H., Chen, S., Liu, Z., Jiang, F., Wang, B.: Humans or LLMs as the judge? a study on judgement bias. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 8301–8327. Association for Compu- tational Linguistics (2024). https://doi.org/10.18653/v1/2024.emnlp-main.474

work page doi:10.18653/v1/2024.emnlp-main.474 2024

[4] [4]

In: Findings of the Association for Computational Linguistics: ACL 2024

Chen, Y., Yan, S., Liu, S., Li, Y., Xiao, Y.: EmotionQueen: A benchmark for evaluating empathy of large language models. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 2149–2176. Association for Computa- tional Linguistics (2024). https://doi.org/10.18653/v1/2024.findings-acl.128

work page doi:10.18653/v1/2024.findings-acl.128 2024

[5] [5]

In: Proceedings of the 63rd Annual MeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers)

Chiang, C.H., Lee, H.y., Lukasik, M.: TRACT: Regression-aware fine-tuning meets chain-of-thought reasoning for LLM-as-a-judge. In: Proceedings of the 63rd Annual MeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers). pp. 2934–2952. Association for Computational Linguistics (2025). https://doi.org/ 10.18653/v1/2025.acl-long.147

work page doi:10.18653/v1/2025.acl-long.147 2025

[6] [6]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

D’Souza, J., Babaei Giglou, H., Münch, Q.: YESciEval: Robust LLM-as-a-judge for scientific question answering. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 13749–13783. Association for Computational Linguistics (2025). https://doi.org/ 10.18653/v1/2025.acl-long.675

work page doi:10.18653/v1/2025.acl-long.675 2025

[7] [7]

Psychotherapy 48(1), 43–49 (2011)

Elliott, R., Bohart, A.C., Watson, J.C., Greenberg, L.S.: Empathy. Psychotherapy 48(1), 43–49 (2011). https://doi.org/10.1037/a0022187 20 T. Han, B. Xu et al

work page doi:10.1037/a0022187 2011

[8] [8]

Suicide and Life-Threatening Behavior 37(3), 338–352 (2007)

Gould, M.S., Kalafat, J., Harris Munfakh, J.L., Kleinman, M.: An evaluation of cri- sis hotline outcomes part 2: Suicidal callers. Suicide and Life-Threatening Behavior 37(3), 338–352 (2007). https://doi.org/10.1521/suli.2007.37.3.338

work page doi:10.1521/suli.2007.37.3.338 2007

[9] [9]

A survey on llm-as-a-judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, Y., Guo, J.: A survey on LLM-as-a-judge. The Innovation7(6), 101253 (2026). https://doi.org/10.1016/j.xinn.2025.101253

work page doi:10.1016/j.xinn.2025.101253 2026

[10] [10]

Cognitive Therapy and Research36(5), 427–440 (2012)

Hofmann, S.G., Asnaani, A., Vonk, I.J.J., Sawyer, A.T., Fang, A.: The efficacy of cognitive behavioral therapy: A review of meta-analyses. Cognitive Therapy and Research36(5), 427–440 (2012). https://doi.org/10.1007/s10608-012-9476-1

work page doi:10.1007/s10608-012-9476-1 2012

[11] [11]

JMIR mHealth and uHealth6(11), e12106 (2018)

Inkster, B., Sarda, S., Subramanian, V.: An empathy-driven, conversational artifi- cial intelligence agent (Wysa) for digital mental well-being: Real-world data eval- uation mixed-methods study. JMIR mHealth and uHealth6(11), e12106 (2018). https://doi.org/10.2196/12106

work page doi:10.2196/12106 2018

[12] [13]

Lee, D., Hwang, Y., Kim, Y., Park, J., Jung, K.: Are LLM-judges robust to expres- sions of uncertainty? investigating the effect of epistemic markers on LLM-based evaluation. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers)....

work page doi:10.18653/v1/2025.naacl-long.452 2025

[13] [15]

In: Findings of the Association for Computational Linguistics: EMNLP

Li, A., Lu, Y., Song, N., Zhang, S., Ma, L., Lan, Z.: Understanding the therapeutic relationship between counselors and clients in online text-based counseling using LLMs. In: Findings of the Association for Computational Linguistics: EMNLP

work page

[14] [16]

1280–1303

pp. 1280–1303. Association for Computational Linguistics (2024). https:// doi.org/10.18653/v1/2024.findings-emnlp.69

work page doi:10.18653/v1/2024.findings-emnlp.69 2024

[15] [18]

In: The Twelfth International Conference on Learning Represen- tations (2024), https://openreview.net/forum?id=gtkFw6sZGS

Li, J., Sun, S., Yuan, W., Fan, R.Z., Zhao, H., Liu, P.: Generative judge for evalu- ating alignment. In: The Twelfth International Conference on Learning Represen- tations (2024), https://openreview.net/forum?id=gtkFw6sZGS

work page 2024

[16] [19]

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Li, Y., Yao, J., Bunyi, J.B.S., Frank, A.C., Hwang, A.H.C., Liu, R.: CounselBench: A large-scale expert evaluation and adversarial benchmarking of large language models in mental health question answering. arXiv preprint arXiv:2506.08584 (2025), https://arxiv.org/abs/2506.08584 Auditing Stealth Sycophancy 21

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [21]

Liu, S., Zheng, C., Demasi, O., Sabour, S., Li, Y., Yu, Z., Jiang, Y., Huang, M.: Towards emotional support dialog systems. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Interna- tional Joint Conference on Natural Language Processing (Volume 1: Long Pa- pers). pp. 3469–3483. Association for Comput...

work page doi:10.18653/v1/2021.acl-long.269 2021

[18] [22]

In: Findings of the Association for Computational Linguistics: ACL 2025

Na, H., Hua, Y., Wang, Z., Shen, T., Yu, B., Wang, L., Wang, W., Torous, J., Chen, L.: A survey of large language models in psychotherapy: Current landscape and future directions. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 7362–7376. Association for Computational Linguistics (2025). https: //doi.org/10.18653/v1/2025.findi...

work page doi:10.18653/v1/2025.findings-acl.385 2025

[19] [24]

LLM Evaluators Recognize and Favor Their Own Generations

Panickssery, A., Bowman, S.R., Feng, S.: LLM evaluators recognize and favor their own generations. In: Advances in Neural Information Processing Systems (2024), https://arxiv.org/abs/2404.13076

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [25]

In: Findings of the Association for Computational Lin- guistics: EMNLP 2024

Park, J., Jwa, S., Meiying, R., Kim, D., Choi, S.: OffsetBias: Leveraging debiased data for tuning evaluators. In: Findings of the Association for Computational Lin- guistics: EMNLP 2024. pp. 1043–1067. Association for Computational Linguistics (2024). https://doi.org/10.18653/v1/2024.findings-emnlp.57

work page doi:10.18653/v1/2024.findings-emnlp.57 2024

[21] [27]

Sentence-bert: Sentence embeddings using siamese bert-networks

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3982–3992. Association for Computational Linguistics (2019). https://doi.org/10.1...

work page doi:10.18653/v1/d19-1410 2019

[22] [28]

Europe’s Journal of Psychology12(3), 348–362 (2016)

Rnic, K., Dozois, D.J.A., Martin, R.A.: Cognitive distortions, humor styles, and depression. Europe’s Journal of Psychology12(3), 348–362 (2016). https://doi. org/10.5964/ejop.v12i3.1118

work page doi:10.5964/ejop.v12i3.1118 2016

[23] [29]

Journal of Consulting Psychology21(2), 95–103 (1957)

Rogers, C.R.: The necessary and sufficient conditions of therapeutic personality change. Journal of Consulting Psychology21(2), 95–103 (1957). https://doi.org/ 10.1037/h0045357

work page doi:10.1037/h0045357 1957

[24] [30]

Child Development 73(6), 1830–1843 (2002)

Rose, A.J.: Co-rumination in the friendships of girls and boys. Child Development 73(6), 1830–1843 (2002). https://doi.org/10.1111/1467-8624.00509

work page doi:10.1111/1467-8624.00509 2002

[25] [31]

Journal of Personality and Social Psychology39, 1161–1178 (12 1980)

Russell, J.A.: A circumplex model of affect. Journal of Personality and Social Psychology39(6), 1161–1178 (1980). https://doi.org/10.1037/h0077714 22 T. Han, B. Xu et al

work page doi:10.1037/h0077714 1980

[26] [32]

In: Chinese Conference on Pattern Recognition and Computer Vision

Shan, G., Ma, X., Bai, X., Zhu, H., Wang, T., Zhu, S., Wang, L.: Dental diagnosis from x-ray panoramic radiography images: A dataset and a hybrid framework. In: Chinese Conference on Pattern Recognition and Computer Vision. pp. 234–248 (2024). https://doi.org/10.1007/978-981-97-8496-7_17

work page doi:10.1007/978-981-97-8496-7_17 2024

[27] [33]

Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., Vosoughi, S.: Judging the judges: A systematic study of position bias in LLM-as-a-judge. In: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Lin- guistics. pp. 292–314. The Asian Federa...

work page 2025

[28] [34]

Hugging Face dataset (2026), https://huggingface.co/datasets/SungJoo/Cradle-Dialogue, dataset card

SungJoo: CRADLE-Dialogue: Crisis-response dialogue dataset. Hugging Face dataset (2026), https://huggingface.co/datasets/SungJoo/Cradle-Dialogue, dataset card

work page 2026

[29] [35]

The Canadian Journal of Psychiatry64(7), 456–464 (2019)

Vaidyam, A.N., Wisniewski, H., Halamka, J.D., Kashavan, M.S., Torous, J.B.: Chatbots and conversational agents in mental health: A review of the psychiatric landscape. The Canadian Journal of Psychiatry64(7), 456–464 (2019). https:// doi.org/10.1177/0706743719828977

work page doi:10.1177/0706743719828977 2019

[30] [37]

Self-Preference Bias in LLM-as-a-Judge

Wataoka, K., Takahashi, T., Ri, R.: Self-preference bias in LLM-as-a-judge. arXiv preprint arXiv:2410.21819 (2024), https://arxiv.org/abs/2410.21819

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [38]

doi: 10.18653/v1/2024.emnlp-main

Watts, I., Gumma, V., Yadavalli, A., Seshadri, V., Swaminathan, M., Sitaram, S.: PARIKSHA: A large-scale investigation of human-LLM evaluator agreement on multilingual and multi-cultural data. In: Proceedings of the 2024 Conference on EmpiricalMethodsinNaturalLanguageProcessing.pp.7900–7932.Associationfor Computational Linguistics (2024). https://doi.org/...

work page doi:10.18653/v1/2024.emnlp-main 2024

[32] [39]

Fine-grained prediction of reading comprehension from eye movements,

Xie, H., Chen, Y., Xing, X., Lin, J., Xu, X.: PsyDT: Using LLMs to construct the digital twin of psychological counselor with personalized counseling style for psychological counseling. In: Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). pp. 1081–1115. Association for Computational Linguis...

work page doi:10.18653/v1/ 2025

[33] [41]

In: Findings of the Association for Computational Linguistics: ACL 2024

Zhang, C., Li, R., Tan, M., Yang, M., Zhu, J., Yang, D., Zhao, J., Ye, G., Li, C., Hu, X.: CPsyCoun: A report-based multi-turn dialogue reconstruction and evaluation framework for Chinese psychological counseling. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 13947–13966. Association for Com- putational Linguistics (2024). h...

work page doi:10.18653/v1/2024.findings-acl.830 2024

[34] [42]

Zhang, M., Yang, X., Zhang, X., Labrum, T., Chiu, J.C., Eack, S.M., Fang, F., Wang, W.Y., Chen, Z.: CBT-bench: Evaluating large language models on assisting Auditing Stealth Sycophancy 23 cognitive behavior therapy. In: Proceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics: Human Langu...

work page doi:10.18653/v1/2025 2025

[35] [43]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zhang, Q., Wang, Y., Jiang, Y., Li, L., Wu, C., Wang, Y., Jiang, X., Shang, L., Tang, R., Lyu, F., Ma, C.: Crowd comparative reasoning: Unlocking comprehensive evaluations for LLM-as-a-judge. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 5059–

work page

[36] [44]

https://doi.org/10.18653/ v1/2025.acl-long.252

Association for Computational Linguistics (2025). https://doi.org/10.18653/ v1/2025.acl-long.252

work page 2025

[37] [45]

In: International Conference on Learning Rep- resentations (2020), https://openreview.net/forum?id=SkeHuCVFDr

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: International Conference on Learning Rep- resentations (2020), https://openreview.net/forum?id=SkeHuCVFDr

work page 2020

[38] [46]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing

Zhao, H., Li, L., Chen, S., Kong, S., Wang, J., Huang, K., Gu, T., Wang, Y., Wang, J., Dandan, L., Li, Z., Teng, Y., Xiao, Y., Wang, Y.: ESC-eval: Evalu- ating emotion support conversations in large language models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing. pp. 15785–15810. Association for Computational ...

work page doi:10.18653/v1/2024.emnlp-main.883 2024