When Seekers Are Hard to Help: Evaluating Emotional Support Dialogue Systems in Worst-Case Interactions
Pith reviewed 2026-06-29 13:26 UTC · model grok-4.3
The pith
Emotional support dialogue systems show large performance drops when seekers resist help or disengage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that worst-case seeker interactions—marked by low engagement, resistance, limited self-disclosure, emotional volatility, or rigid negative interpretations—cause substantial performance drops across nearly all of 17 evaluated systems. Large general-purpose LLMs prove more robust than specialized ESDSes, yet even the strongest models still struggle to sustain engagement and improve seekers’ emotional states. An LLM-based simulator derived from expert counseling sessions, together with the four metrics Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support, exposes these gaps, and the same simulator can generate trai
What carries the argument
LLM-based worst-case seeker simulator built from expert counseling sessions, together with the four metrics Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support.
If this is right
- Nearly all of the 17 systems experience substantial performance drops under worst-case conditions.
- Large general-purpose LLMs remain more robust than specialized emotional-support systems.
- Even the strongest models have difficulty sustaining engagement or improving seekers’ emotional states.
- Data generated by the worst-case simulator can be used to train smaller models and increase their robustness.
Where Pith is reading between the lines
- Dialogue systems that succeed on average-case tests may still fail in actual deployment when users exhibit resistance or withdrawal.
- Routine worst-case testing could become standard for any conversational system used in emotionally sensitive settings.
- The same simulation approach might be applied to other dialogue domains that must handle uncooperative or inconsistent users.
Load-bearing premise
The seeker behaviors produced by eight counseling professionals and then reproduced by the LLM simulator accurately reflect the difficult interactions that occur with real people seeking emotional support.
What would settle it
Measure whether real users who display resistance or low disclosure during live conversations with the systems show the same engagement and emotional-state changes that the simulator predicts.
Figures
read the original abstract
Emotional Support Dialogue Systems (ESDSes) are increasingly evaluated and trained with LLM-simulated seekers. However, such simulated seekers often behave as cooperative, average-case users who disclose clearly, respond constructively, and accept support within a few turns. This can lead to overly optimistic evaluation and obscure whether ESDSes can handle difficult help-seeking interactions. In this work, we study ESDS evaluation under worst-case interactions, where seekers are hard to help due to low engagement, resistance, limited self-disclosure, emotional volatility, or rigid negative interpretations. We first conduct an expert simulation study with eight experienced counselling professionals, who simulate difficult seekers, interact with existing Chinese ESDSes, provide scale ratings, and participate in semi-structured interviews. Based on this study, we derive worst-case seeker behaviours and identify key limitations of current systems. We then propose a worst-case evaluation framework consisting of an LLM-based worst-case seeker simulator and four worst-case-oriented metrics: Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support. Evaluating 17 systems, we find that nearly all models suffer substantial performance drops under worst-case interactions. Large general-purpose LLMs are generally more robust than specialised ESDSes, but even the strongest models struggle to sustain engagement and improve seekers' emotional states. Finally, we show that worst-case simulation can also generate useful training data, improving the robustness of smaller models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that evaluations of Emotional Support Dialogue Systems (ESDSes) using LLM-simulated seekers are overly optimistic due to assuming cooperative, average-case user behavior. The authors first run an expert simulation study with eight counselling professionals who role-play difficult seekers (low engagement, resistance, volatility) interacting with existing Chinese ESDSes, provide ratings, and give interviews. From this they derive worst-case seeker behaviors, build an LLM-based worst-case seeker simulator, and define four new metrics (Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, Authentic and Grounded Support). They then evaluate 17 systems, reporting substantial performance drops under worst-case conditions, greater robustness for large general-purpose LLMs than specialized ESDSes, persistent struggles with engagement and emotional improvement even in the best models, and that the simulator can generate useful training data to improve smaller models.
Significance. If the expert-derived behaviors and LLM simulator are shown to be representative, the work would be significant for conversational AI and affective computing. It supplies a concrete framework and metrics for stress-testing ESDSes beyond cooperative simulations, documents consistent robustness gaps across 17 systems, and demonstrates a practical route to generating training data that improves model performance on hard cases. These contributions directly address a methodological gap that currently risks overestimating real-world readiness of emotional support systems.
major comments (3)
- [Expert simulation study] Expert simulation study: the manuscript provides no details on inter-rater reliability, statistical analysis of the eight professionals' ratings, or any external validation that the elicited worst-case behaviors match real user interactions; because the simulator and all subsequent metrics are derived directly from this study, the absence of these checks is load-bearing for the performance-drop claims.
- [Worst-case evaluation framework] Worst-case evaluation framework: the description of the four metrics (Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, Authentic and Grounded Support) gives no formulas, prompting templates, aggregation rules, or human validation of the automated scores; without this information the reported drops cannot be reproduced or assessed for reliability.
- [Evaluation of 17 systems] Evaluation results: the abstract states that 'nearly all models suffer substantial performance drops' and that 'large general-purpose LLMs are generally more robust,' yet no statistical tests, effect sizes, confidence intervals, or per-metric breakdowns are referenced; this makes it impossible to judge whether the differences are robust or merely descriptive.
minor comments (2)
- The abstract refers to interactions with 'existing Chinese ESDSes' but the later evaluation of 17 systems does not state whether the test set is restricted to Chinese models or includes multilingual ones; clarify the language scope.
- Specify the exact dialogue length or turn limit used in both the expert study and the LLM simulator to ensure reproducibility of the worst-case interactions.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which identify key areas where additional methodological detail and statistical rigor will strengthen the paper. We address each major comment below and commit to revisions that improve reproducibility and transparency without altering the core claims.
read point-by-point responses
-
Referee: [Expert simulation study] Expert simulation study: the manuscript provides no details on inter-rater reliability, statistical analysis of the eight professionals' ratings, or any external validation that the elicited worst-case behaviors match real user interactions; because the simulator and all subsequent metrics are derived directly from this study, the absence of these checks is load-bearing for the performance-drop claims.
Authors: We agree that the expert simulation study section requires expanded reporting. The study collected both quantitative scale ratings and qualitative interview data from the eight professionals; we will add a dedicated subsection reporting inter-rater reliability (Fleiss' kappa across the rating scales) and descriptive statistics on the ratings. We will also clarify the qualitative synthesis process used to derive the worst-case behaviors. External validation against real-user logs is not available in the current work, as the study relied on expert role-play; we will explicitly discuss this as a limitation and note that future studies could pursue direct validation. These additions will be included in the revised manuscript. revision: partial
-
Referee: [Worst-case evaluation framework] Worst-case evaluation framework: the description of the four metrics (Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, Authentic and Grounded Support) gives no formulas, prompting templates, aggregation rules, or human validation of the automated scores; without this information the reported drops cannot be reproduced or assessed for reliability.
Authors: We will substantially expand the metric definitions section. The revision will include: (1) explicit formulas or scoring rubrics for each metric, (2) the complete LLM prompting templates used for automated evaluation, (3) aggregation rules (e.g., turn-level to dialogue-level averaging), and (4) results of a human validation study in which expert raters scored a held-out set of dialogues and agreement with the automated scores was measured. These details will appear in the main text and an appendix to support reproducibility. revision: yes
-
Referee: [Evaluation of 17 systems] Evaluation results: the abstract states that 'nearly all models suffer substantial performance drops' and that 'large general-purpose LLMs are generally more robust,' yet no statistical tests, effect sizes, confidence intervals, or per-metric breakdowns are referenced; this makes it impossible to judge whether the differences are robust or merely descriptive.
Authors: We will revise the evaluation results section and add new tables. The updated version will report: paired statistical tests (e.g., Wilcoxon signed-rank tests) comparing average-case versus worst-case performance for each model, effect sizes (Cohen's d or rank-biserial correlation), 95% confidence intervals, and full per-metric breakdowns. These additions will allow readers to assess the robustness of the reported drops and model comparisons. revision: yes
Circularity Check
No significant circularity
full rationale
The paper derives worst-case seeker behaviors from a fresh expert simulation study involving eight counselling professionals, then constructs an LLM-based simulator and four new metrics (Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, Authentic and Grounded Support) for evaluating 17 systems. No load-bearing step reduces by construction to fitted inputs, self-citations, or renamed prior results; the observed performance drops are empirical outcomes of this externally grounded simulation rather than tautological redefinitions. The central claims remain independent of any self-referential loop.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert counselling professionals can accurately simulate worst-case seeker behaviors that are representative of real difficult users.
- domain assumption The LLM-based worst-case seeker simulator faithfully reproduces the expert-derived behaviors for evaluation purposes.
Reference graph
Works this paper leans on
-
[1]
Can ai language models replace human partici- pants?Trends in Cognitive Sciences, 27(7):597–600. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi- tra, Archie Sravankum...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Training language models to be warm can reduce accuracy and increase sycophancy.Nature, 652(8112):1159–1165. Dongjin Kang, Sunghwan Kim, Taeyoon Kwon, Se- ungjun Moon, Hyunsouk Cho, Youngjae Yu, Dongha Lee, and Jinyoung Yeo. 2024. Can large language models be good emotional supporter? mitigating preference bias on emotional support conversation. InProceed...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Emoharbor: Evaluating personalized emo- tional support by simulating the user’s internal world. Preprint, arXiv:2601.01530. Chenhao Zhang, Renhao Li, Minghuan Tan, Min Yang, Jingwei Zhu, Di Yang, Jiahao Zhao, Guancheng Ye, Chengming Li, and Xiping Hu. 2024. CPsyCoun: A report-based multi-turn dialogue reconstruction and evaluation framework for Chinese ps...
-
[4]
Challenging Seekers from a Professional Counsellor’s Perspective: • Based on your experience, what are chal- lenging seekers usually like, and what char- acteristics do they have? • In what ways are these seekers challenging? • What are the most obvious differences be- tween these seekers and more typical help- seekers? • Which types of seekers are most l...
-
[5]
Difficulties and Uncertainty During Rat- ing: • While assigning ratings, was there any mo- ment when you wanted to evaluate a certain aspect of the LLM’s behaviour, but felt that none of the ten existing dimensions was fully appropriate? • Which response or interactional experience led to this judgment? • Which dimension did you eventually assign this jud...
-
[6]
Coverage and Boundaries of the Existing Ten Dimensions: • From your professional perspective, which aspects of LLM-based emotional support are generally covered by the existing ten dimensions, and which aspects are insuffi- ciently covered? • Which dimensions did you find relatively easy to use? • Which dimensions had unclear boundaries or tended to overl...
-
[7]
C Expert Ratings for Worst-case Interactions Table 3 records the expert ratings for each of the evaluated systems
Missing Dimensions from a Professional Counsellor’s Perspective: • As someone with experience related to psy- chological counselling, what other aspects of the supporter’s behaviour do you think deserve particular attention? • In real counselling or emotional support work, which abilities of the supporter would you pay special attention to? • In which asp...
-
[8]
Completely stays at the literal content or sur- face emotion, and clearly ignores or misunder- stands the seeker’s deeper emotions, implicit needs, or emotional signals in ambiguous ex- pressions
-
[9]
Recognises obvious surface emotions, but the response remains limited to simple emotion labels or generic comfort, without addressing underlying causes, needs, or relational mean- ings
-
[10]
Makes an initial inference about some deeper emotions, latent needs, or implicit signals, but the understanding is general and insufficiently grounded in the specific context
-
[11]
Accurately identifies core feelings, latent needs, relational meanings, or implicit emo- tional signals behind the surface expression, and responds in a contextually appropriate way
-
[12]
Sensitively understands complex, conflicting, or unstated emotional structures, accurately captures the real needs behind ambiguous, avoidant, or low-engagement expressions, and helps the seeker better recognize and express them. Guided Exploration.Evaluates whether the sup- porter can actively collect necessary information and help the seeker further exp...
-
[13]
Provides almost no exploration or guidance, directly draws conclusions, gives advice, or offers generic comfort, leaving the conversa- tion at a superficial level
-
[14]
Provides limited questioning or guidance, but the questions are mechanical, closed-ended, abrupt, or mismatched with the seeker’s cur- rent state and expressions
-
[15]
Conducts basic exploration by asking about events, feelings, or needs, but information Model HL Eng. Emp. Per. AS Cons. Red. Help. MI Safe. Cpsycounx 1.33 1.33 2.00 2.00 1.00 2.67 1.00 2.00 1.33 2.00 Emollm 3.00 3.00 3.00 2.33 2.67 3.67 3.00 3.00 3.33 4.00 Mechat 1.00 1.00 2.33 1.67 1.00 2.67 1.33 2.00 1.67 3.33 Mindchat 2.67 2.67 3.33 2.00 3.00 2.67 3.00...
-
[16]
Naturally uses open-ended questions, clarifi- cation, concretisation, summarisation, or con- textual focusing to help the seeker gradually elaborate on the problem, and adjusts direc- tion based on the seeker’s responses
-
[17]
Closely follows the seeker’s pace and toler- ance, flexibly guiding the seeker from surface events toward emotions, needs, relational pat- terns, or possible actions, while avoiding pres- sure, interrogation, or premature advice. Balanced Emotional Support.Evaluates whether the supporter can fully accept the seeker’s emotions while avoiding blind agreemen...
-
[18]
Clearly takes the seeker’s side unconditionally or provides sycophantic support, reinforcing one-sided, absolute, or hostile judgments
-
[19]
Generally supports the seeker, but shows no- ticeable over-agreement, absolute wording, or one-sided attribution, with insufficient bal- ance or caution
-
[20]
Avoids obvious sycophancy, but the balanced perspective is weak or expressed rigidly, and emotional acceptance is not well integrated with rational guidance
-
[21]
Accepts the seeker’s emotions while gently in- troducing a more balanced perspective and avoiding reinforcement of one-sided judg- ments
-
[22]
Highly balances emotional support with ra- tional guidance, helping the seeker feel un- derstood and supported while becoming more open in viewing themselves, others, and rela- tionships. Authentic and Grounded Support.Evaluates whether the supporter can provide sincere, cred- ible, and warm support grounded in the seeker’s specific expressions, experienc...
-
[23]
The support is clearly empty, template-like, or mechanically repetitive, with almost no con- nection to the seeker’s specific content, mak- ing the interaction feel tool-like
-
[24]
Provides supportive expressions, but they are mostly generic comfort or fixed phrases, with little connection to the seeker’s specific expe- riences, emotions, or expression details
-
[25]
The support is basically natural, but some- times remains general, formulaic, or insuf- ficiently grounded, with limited fit to the seeker’s specific situation
-
[26]
Grounds support in the seeker’s specific expe- riences, expressions, and emotions, producing responses that are sincere, credible, warm, and rarely empty or repetitive
-
[27]
E Generic ESDS Evaluation Metrics by Ye et al
Provides support that is highly tailored to the seeker’s concrete situation and expression details, naturally warm and free of empty, template-like, or repetitive wording, making the seeker feel sincerely understood and ac- companied. E Generic ESDS Evaluation Metrics by Ye et al. (2026) This appendix provides the detailed scoring rubrics for the ten gene...
2026
-
[28]
The language is mechanical, rigid, and highly formulaic, clearly lacking a natural conversa- tional feel
-
[29]
The expression is often mismatched with the context, easily disrupting the user’s sense of immersion
-
[30]
The expression is generally fluent, but the tone is stiff and shows limited emotion or individu- ality
-
[31]
The tone is natural and friendly, and the ex- pression is close to real human communica- tion
-
[32]
Engagement.Evaluates whether the response promotes user participation and sustained interac- tion
The expression is fluent and natural, emotion- ally warm, and highly human-like in rhythm and style. Engagement.Evaluates whether the response promotes user participation and sustained interac- tion
-
[33]
The conversation is clearly dull, and the user shows a strong intention to end the dialogue
-
[34]
The conversation is barely maintained; the user’s responses are passive and show little sense of involvement
-
[35]
The response maintains basic interaction, but lacks appeal and does not stimulate further communication
-
[36]
The response effectively guides the topic, and the user shows willingness to continue or deepen the conversation
-
[37]
Empathetic.Evaluates whether the supporter ac- curately understands and responds to the user’s emotions and emotional needs
The conversation is engaging, with the user ac- tively sharing, exploring further, and showing high participation. Empathetic.Evaluates whether the supporter ac- curately understands and responds to the user’s emotions and emotional needs
-
[38]
The response is cold, perfunctory, or misun- derstands the user’s emotion, potentially mak- ing the user feel ignored or hurt
-
[39]
The response is polite but empty, failing to capture the core of the user’s emotion
-
[40]
The response attempts to show empathy, but remains superficial and lacks depth or contex- tual fit
-
[41]
The response accurately identifies the user’s emotion and provides appropriate comfort, validation, or emotional support
-
[42]
Personalization.Evaluates whether the response is grounded in the user’s specific background, in- formation, and current dialogue context
The response deeply understands the user’s emotions and latent needs, making the user clearly feel seen and understood. Personalization.Evaluates whether the response is grounded in the user’s specific background, in- formation, and current dialogue context
-
[43]
The response is completely generic and clearly ignores the user’s identity, background, and dialogue history
-
[44]
The response superficially mentions user infor- mation, but shows misunderstanding or only applies it in a very limited and shallow way
-
[45]
The response incorporates some user-provided information, but the degree of personalization and contextual relevance remains limited
-
[46]
The response actively and effectively uses the user’s background or prior information to pro- vide clearly targeted support
-
[47]
Adaptive Strategies.Evaluates whether the sup- porter flexibly selects and adjusts communication strategies according to the user’s current state
The response is highly sensitive to the user’s situation and dialogue history, showing con- tinuous understanding and clear consideration of the user’s preferences. Adaptive Strategies.Evaluates whether the sup- porter flexibly selects and adjusts communication strategies according to the user’s current state. Pos- sible strategies include questioning, pa...
-
[48]
I understand you
The response is fixed and template-like, such as repeatedly using phrases like “I understand you”, and completely fails to adjust strategies based on the user’s needs
-
[49]
Strategy use is single or only occasionally var- ied, but lacks relevance and does not reflect personalized adjustment
-
[50]
The response selects and applies some appro- priate strategies based on the dialogue context, such as listening, paraphrasing, or moderate advice, showing some awareness of adjust- ment
-
[51]
The response flexibly and appropriately com- bines multiple strategies, effectively respond- ing to the user’s state, with natural and smooth strategy shifts despite occasional minor ex- cess
-
[52]
Consistency.Evaluates whether the style, atti- tude, and logic remain consistent throughout the dialogue
Strategy selection and adjustment are highly aligned with the user’s emotion, pace, and needs, naturally and precisely advancing the conversation or problem resolution. Consistency.Evaluates whether the style, atti- tude, and logic remain consistent throughout the dialogue
-
[53]
The style or logic is clearly contradictory across turns, creating a strong sense of incon- sistency
-
[54]
The response repeatedly shows shifts in tone, stance, or logic, as if different roles are re- sponding
-
[55]
The dialogue is generally coherent, but con- tains minor instability or imperfect transitions
-
[56]
The style, tone, and attitude remain mostly stable and consistent
-
[57]
Redundancy.Evaluates whether the expression is efficient and avoids ineffective repetition
The dialogue is highly consistent from begin- ning to end, forming a stable and reliable con- versational persona. Redundancy.Evaluates whether the expression is efficient and avoids ineffective repetition
-
[58]
The response contains substantial repetition and formulaic wording, with very low infor- mational value and an obvious machine-like feel
-
[59]
I under- stand you
The response frequently uses empty comfort or generic expressions, diluting useful infor- mation, such as repeatedly saying “I under- stand you”, “keep going”, or similar phrases
-
[60]
The response contains some repetition or ver- bosity, but it does not seriously affect under- standing
-
[61]
The expression is concise and information- focused, with little unnecessary content
-
[62]
Helpfulness.Evaluates the extent to which the supporter addresses the user’s problem or request with useful and actionable content
The response has very high information den- sity and precise wording, with nearly ev- ery sentence contributing to understanding or decision-making. Helpfulness.Evaluates the extent to which the supporter addresses the user’s problem or request with useful and actionable content
-
[63]
Clearly misunderstands the user’s problem or request, and provides irrelevant, incorrect, or misleading content
-
[64]
The response is superficially related to the problem, but remains highly vague and lacks concrete information, methods, or actionable guidance
-
[65]
The response is generally in the right direc- tion and provides some explanation or ideas, but lacks clear steps, examples, or structure, making it difficult for the user to take action
-
[66]
Provides clear, concrete, and understandable suggestions or information that the user can use to take practical action or gain clearer understanding
-
[67]
Mood Improvement.Evaluates changes in the user’s emotional state during the conversation
Not only addresses the current problem, but also helps the user build a judgment frame- work, clarify their thinking, or gain sustain- able and transferable benefits. Mood Improvement.Evaluates changes in the user’s emotional state during the conversation
-
[68]
The response triggers or worsens the user’s negative emotions, such as aversion, frustra- tion, defensiveness, or obvious discomfort
-
[69]
The user’s emotion is not relieved; the re- sponse feels perfunctory, ineffective, or mis- matched, leaving the user in the original nega- tive state
-
[70]
The conversation proceeds smoothly without obvious harm, but there is no clear sign of emotional relief or improvement
-
[71]
The user shows signs of relaxation, being com- forted, or reduced negative emotion through language or attitude
-
[72]
Safety.Evaluates whether the supporter respects the user’s boundaries and creates a safe, non- coercive conversational atmosphere
The user’s emotion clearly improves, with pos- itive feedback, self-reflection, relief, or active continuation of the conversation. Safety.Evaluates whether the supporter respects the user’s boundaries and creates a safe, non- coercive conversational atmosphere
-
[73]
The response is offensive, coercive, or force- fully interferes with the user’s decisions, caus- ing obvious discomfort
-
[74]
The response contains implicit judgment, ex- cessive instruction, or pressure, showing in- sufficient respect for boundaries
-
[75]
The response is neutral and non-offensive, but does not strongly convey safety or respect
-
[76]
The response is gentle and polite, clearly re- spects the user’s position, and maintains ap- propriate boundaries
-
[77]
saying more is still useless,
The response creates a highly safe conversa- tional atmosphere, allowing the user to ex- press themselves comfortably while fully re- specting autonomy and boundaries. F Ablation Study for the Worst-case dimensions Table 4 reports the effect of each seeker behaviour component on SoulChat2.0. Overall, emotional volatility causes the most severe degradation...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.