Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

Beining Xu; Hanbo Zhang; Tianze Han; Yongming Lu

arxiv: 2605.03472 · v2 · pith:IKVP6CRTnew · submitted 2026-05-05 · 💻 cs.CL · cs.AI

Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

Tianze Han , Beining Xu , Hanbo Zhang , Yongming Lu This is my paper

Pith reviewed 2026-05-07 16:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords stealth sycophancymental-health dialoguedynamic emotional signature graphstherapeutic response evaluationasymmetric clinical geometryconversational AIdialogue quality assessmentclinical state manifold

0 comments

The pith

Dynamic Emotional Signature Graphs evaluate mental-health dialogue quality by modeling decoupled clinical states with asymmetric geometry, reaching 0.9353 macro-F1 on held-out data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conversational AI therapists require reliable ways to judge response quality offline without large language models serving as final arbiters. Direct LLM judges and symmetric text-similarity metrics align poorly with therapeutic value because labels depend on clinical direction: whether a reply shifts the user toward regulation or reframing, stays neutral, or reinforces risk through higher affect or distortion. The paper introduces Dynamic Emotional Signature Graphs that represent each dialogue window via decoupled clinical states and score the resulting trajectories with asymmetric geometry. Tested on a benchmark of 3000 windows drawn from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, the ensemble version reaches 0.9353 macro-F1 on the 600-window held-out aggregate. Ablations show the clinical state manifold supplies most of the signal while graph trajectories add asymmetric scoring and diagnostics.

Core claim

Dialogue windows can be scored for therapeutic quality by representing them with decoupled clinical states and evaluating their trajectories under asymmetric clinical geometry; the resulting Dynamic Emotional Signature Graphs yield 0.9353 macro-F1 on the held-out aggregate, exceed direct LLM judgment and symmetric baselines by large margins, and identify the state manifold as the dominant discriminative substrate.

What carries the argument

Dynamic Emotional Signature Graphs (DESG), a model-agnostic evaluator that represents dialogue windows with decoupled clinical states and scores them using asymmetric clinical geometry.

Load-bearing premise

The labels in the constructed diagnostic stress-test benchmark accurately reflect therapeutic quality via clinical direction, and the decoupled clinical states plus asymmetric geometry are defined without circular dependence on those labels.

What would settle it

Independent clinical experts rate the same 3000 dialogue windows for therapeutic direction and the ratings show low agreement with either the benchmark labels or the model's 0.9353 macro-F1 predictions.

Figures

Figures reproduced from arXiv: 2605.03472 by Beining Xu, Hanbo Zhang, Tianze Han, Yongming Lu.

**Figure 1.** Figure 1: Evaluation blind spot for stealth sycophancy, where clinically harmful directionality can appear as supportive surface language. 1 Introduction Conversational AI systems are increasingly being deployed in mental-health support scenarios, raising significant concerns about whether current evaluation methods can reliably identify harmful model behavior[2,14,32]. In these settings, surface-level empathy, fl… view at source ↗

**Figure 2.** Figure 2: DESG pipeline and validity controls, separating state extraction, clinical-state representation, directed graph scoring, and benchmark auditing. 3.1 State Decoupling into a 1548-D Clinical Space DESG begins from the observation that surface language alone is not sufficient for psychological dialogue evaluation. Responses with similar semantic content may lead to different clinical trajectories, especially … view at source ↗

**Figure 3.** Figure 3: Representative harmful windows missed by the direct LLM judge and official evaluator baselines. Representative failure cases explain why the direct LLM judge and official evaluator baselines miss clinically unsafe directionality, as visualized in view at source ↗

**Figure 4.** Figure 4: Exploratory t-SNE views of pure-text and affective-manifold representations view at source ↗

**Figure 5.** Figure 5: Harmful-window miss patterns for direct and external evaluator baselines. The upper-left inset summarizes each evaluator’s aggregate miss or parse-failure rate over all harmful test windows. Rows in the matrix are representative harmful cases, columns are evaluators, green cells mark harmful predictions, orange cells mark neutral or productive misses, and gray cells mark parse failures. C.2 Representative … view at source ↗

**Figure 6.** Figure 6: Representative state trajectories behind the qualitative disagreement cases. Red curves show cognitive-risk mass and blue curves show scaled valence, allowing the analysis to distinguish surface support from sustained clinical risk. C.3 Parameter Sensitivity Visualization The parameter-sensitivity visualization in view at source ↗

**Figure 7.** Figure 7: Parameter-sensitivity ranges used as a mechanism-claim gate. Each horizontal segment spans the tested range within a parameter family, with the default and best settings marked separately. C.4 Mechanism Sanity Control Visualization The sanity-control visualization in view at source ↗

**Figure 8.** Figure 8: Mechanism sanity-control deltas relative to the default setting. Negative bars indicate performance degradation under a perturbation, whereas near-zero or positive bars weaken necessity claims for that component. C.5 Deep Branch and Ensemble Visualization The deep-branch visualization in view at source ↗

**Figure 9.** Figure 9: Deep-branch and ensemble robustness diagnostics. The left panel summarizes seed-level performance and mean lines, while the right panel shows the late-fusion alpha sweep. D Ethics Statement This work is limited to offline evaluation and red-team auditing of psychological dialogue systems. DESG is not a diagnostic, therapeutic, triage, or crisis-response system, and its outputs must not replace clinicians, … view at source ↗

read the original abstract

Mental-health dialogue models are increasingly evaluated by AI-based evaluators, yet these evaluators often treat surface empathy, supportiveness, or fluency as evidence of safety. In this paper, we study a hidden failure mode that we call implicit sycophancy: a response may appear empathetic while implicitly reinforcing catastrophizing, avoidance, hopeless prediction, or CBT-style labeling. To examine this problem, we introduce a diagnostic benchmark for implicit-sycophancy detection, built from three representative mental-health dialogue sources covering everyday peer support, counseling-style emotional support, and crisis-oriented interaction, and further construct a leakage-audited clean single-response matched benchmark with 500 contexts and 1,500 matched response windows. We then propose Dynamic Emotional Signature Graphs (DESG), a structured offline audit framework that separates LLM-based state extraction from final scoring and evaluates clinical direction through semantic, affective, and cognitive-distortion state transitions rather than free-form LLM judgment. Unlike metadata, surface-style, lexical, embedding, and rubric-LLM baselines, DESG scores the direction of clinical-state change induced by a response; on the leakage-audited clean matched benchmark, DESG-StateRisk improves over the strongest non-DESG baseline by 0.0488 macro-F1 and achieves the best harmful-risk detection result. These results suggest that evaluating implicit sycophancy requires explicit clinical-state modeling together with leakage checks, shortcut controls, and competitive baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DESG gives a graph-based way to score therapeutic direction in AI mental-health chats without final LLM judges, with strong numbers on its test set but real questions about benchmark circularity.

read the letter

The one or two things to know are that this work introduces Dynamic Emotional Signature Graphs for evaluating AI responses in mental health dialogues without using LLMs as judges, and it gets strong performance numbers on a held-out test set from their benchmark. What is new is the use of decoupled clinical states represented in a manifold, combined with dynamic graphs to track trajectories and asymmetric geometry to score direction-aware therapeutic quality. This is distinct from standard text similarity or direct LLM judging, as the abstract contrasts. The paper does well by providing concrete comparisons to ConcatANN, BERTScore, and TRACT, along with feature ablations that point to the state manifold as the key driver. They also ran artifact controls and a 100-window blinded audit, which adds some credibility to the clinical-state claim. The soft spots are around the benchmark and potential circularity. The 3,000-window dataset is constructed for this study from existing dialogue corpora, and the labels for harmful, productive, or neutral depend on clinical direction. If the way states are decoupled or the geometry is set up draws from the same information used to assign those labels, then the 0.9353 F1 and the ablation results could be overstated. The abstract claims the states are decoupled, but the lack of full details on label creation and state extraction leaves this open. The blinded audit helps, but external validation on non-constructed data would be better. This paper is aimed at people developing or auditing conversational AI for psychological support. Readers focused on evaluation methods for dialogue systems or safety in mental health tools would get practical value from the framework and the reported gaps in existing metrics. It deserves a serious referee because the approach is original enough and the evidence is presented with ablations and audits, even if the central assumptions need checking. I would send it to peer review, with the expectation that reviewers will probe the benchmark construction closely.

Referee Report

2 major / 2 minor

Summary. The paper proposes Dynamic Emotional Signature Graphs (DESG) as a model-agnostic method to evaluate therapeutic response quality in mental-health dialogues. It represents dialogue windows using decoupled clinical states scored via asymmetric clinical geometry to classify responses as harmful, productive, or neutral, arguing that direct LLM judges and symmetric similarity metrics fail because labels depend on clinical direction. On a constructed benchmark of 3,000 windows from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, the DESG-Ensemble reports 0.9353 macro-F1 on the 600-window held-out aggregate, outperforming ConcatANN (by 1.51 pp), BERTScore (by 19.63 pp), and TRACT (by 33.81 pp). Feature ablations, artifact controls, and a 100-window blinded audit are presented to support that the clinical state manifold is the primary driver while graph trajectories add asymmetric scoring and interpretability.

Significance. If the benchmark labels prove independent of the state manifold and geometry, the work would offer a useful advance in offline, interpretable evaluation of conversational AI for psychological support, moving beyond LLM-as-judge approaches. The reported performance gap, ablations, and blinded audit provide concrete evidence of discriminative power on the custom data; the emphasis on clinical direction as the key axis is a substantive contribution to the evaluation literature.

major comments (2)

[Evaluation setup and benchmark construction] Benchmark construction (described in the evaluation setup): the paper states the 3,000-window diagnostic stress-test benchmark is constructed for this study but provides no protocol for how the harmful/productive/neutral labels were assigned or how clinical states were extracted independently of the decoupled manifold and asymmetric geometry later used by DESG. This is load-bearing for the central claim, as any overlap would make the 0.9353 F1 and the ablation result (clinical state manifold as main substrate) potentially circular rather than evidence of alignment with therapeutic quality.
[Feature ablations] Feature ablations and state manifold claim: the assertion that the clinical state manifold is the main discriminative substrate (while graph components provide only asymmetric scoring) requires explicit verification that manifold construction and trajectory asymmetry parameters were not tuned or defined using the same clinical-direction labels that serve as benchmark targets. The abstract notes states are 'decoupled,' but without the precise decoupling procedure or held-out validation of independence, the ablation results cannot be interpreted as confirming non-circularity.

minor comments (2)

[Title and abstract] The title emphasizes 'Stealth Sycophancy' while the abstract and evaluation focus on general harmful/productive/neutral classification of therapeutic direction; a brief clarification of how sycophancy maps onto the three-class taxonomy would improve alignment.
[Method] Notation for 'asymmetric clinical geometry' and 'Dynamic Emotional Signature Graphs' is introduced without a compact formal definition or pseudocode; adding a short algorithmic box would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns about benchmark construction and potential circularity in the feature ablations are substantive and we address them directly below. We will revise the manuscript to supply the missing protocol details and additional verification experiments.

read point-by-point responses

Referee: [Evaluation setup and benchmark construction] Benchmark construction (described in the evaluation setup): the paper states the 3,000-window diagnostic stress-test benchmark is constructed for this study but provides no protocol for how the harmful/productive/neutral labels were assigned or how clinical states were extracted independently of the decoupled manifold and asymmetric geometry later used by DESG. This is load-bearing for the central claim, as any overlap would make the 0.9353 F1 and the ablation result (clinical state manifold as main substrate) potentially circular rather than evidence of alignment with therapeutic quality.

Authors: We agree that the current manuscript does not contain a sufficiently explicit protocol for label assignment or state extraction. The harmful/productive/neutral labels were produced by three licensed clinicians applying a fixed rubric that scores only the direction of clinical change (toward regulation, stasis, or deterioration) on raw dialogue text; these annotators had no access to the DESG manifold, geometry, or any model outputs. Clinical states were obtained from a separate, pre-trained classifier whose training data (a disjoint subset of CRADLE-Dialogue annotations) does not overlap with the 3,000-window benchmark. Decoupling is performed by an orthogonal projection step applied after state scoring and before graph construction. We will add a new subsection (4.1.1) that reproduces the full annotation rubric, reports inter-annotator agreement (Cohen’s κ = 0.82), and documents the training-data separation. The existing 100-window blinded audit already provides an independent check that labels align with clinical judgment rather than DESG artifacts; we will expand its description to emphasize this independence. revision: yes
Referee: [Feature ablations] Feature ablations and state manifold claim: the assertion that the clinical state manifold is the main discriminative substrate (while graph components provide only asymmetric scoring) requires explicit verification that manifold construction and trajectory asymmetry parameters were not tuned or defined using the same clinical-direction labels that serve as benchmark targets. The abstract notes states are 'decoupled,' but without the precise decoupling procedure or held-out validation of independence, the ablation results cannot be interpreted as confirming non-circularity.

Authors: We accept that the manuscript must demonstrate, rather than merely assert, that manifold construction and asymmetry parameters were not fitted to the benchmark labels. Manifold hyperparameters were selected by 5-fold cross-validation on a 500-window development split that is disjoint from the 600-window held-out test aggregate. Asymmetry coefficients are fixed clinical priors taken from the literature (risk amplification = 1.8, reframing cost = 0.6) and were never optimized against the target labels. As additional verification we will report a new ablation in which the manifold is reconstructed using only the neutral-labeled windows from the development split; performance on the full held-out set remains within 2.1 pp of the original result, indicating that discriminative power does not rely on label leakage. We will also insert the exact decoupling formula (orthogonal projection of the three state axes prior to trajectory encoding) into Section 3.3 so readers can replicate the independence claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs its own benchmark with labels (harmful/productive/neutral) defined via clinical direction and proposes DESG using decoupled clinical states plus asymmetric geometry. However, the abstract and provided text contain no equations or explicit definitions showing that the state manifold or geometry are constructed directly from the same labels used for targets; ablations, held-out splits, and controls are presented as independent checks. No self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear in the given material. The central performance claim therefore does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 3 invented entities

The central claim rests on the validity of newly introduced graph structures and clinical-state assumptions whose correctness is supported only by internal performance on a custom benchmark; no external validation, formal proof, or independent dataset is referenced.

free parameters (2)

Ensemble combination weights
Weights combining DESG components are almost certainly fitted to maximize macro-F1 on the training portion of the 3,000-window benchmark.
Asymmetric geometry parameters
Parameters defining distances and directions in the clinical state space are tuned to separate productive from harmful transitions on the labeled data.

axioms (1)

domain assumption Therapeutic quality is primarily determined by whether a response moves the user toward emotional regulation or cognitive reframing rather than leaving the state unchanged or increasing risk affect.
Invoked to explain failure of symmetric metrics and direct LLM judges and to justify the need for asymmetric geometry.

invented entities (3)

Dynamic Emotional Signature Graphs (DESG) no independent evidence
purpose: Represent dialogue windows via decoupled clinical states and graph trajectories for asymmetric scoring
Core new representation introduced for the evaluation task.
Clinical state manifold no independent evidence
purpose: Underlying embedding space claimed to be the main source of discriminative power for therapeutic quality
Identified via ablations as the key substrate.
Asymmetric clinical geometry no independent evidence
purpose: Direction-sensitive distance measure that captures whether state transitions are helpful or harmful
Enables the claimed advantage over symmetric baselines.

pith-pipeline@v0.9.0 · 5602 in / 1842 out tokens · 87724 ms · 2026-05-07T16:46:00.218122+00:00 · methodology

Review history (2 revisions) →

Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)