Discourse Diversity in Multi-Turn Empathic Dialogue
Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3
The pith
Large language models reuse the same discourse tactics in consecutive turns of empathic dialogue at nearly twice the rate of human supporters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures.
What carries the argument
MINT (Multi-turn Inter-tactic Novelty Training), a reinforcement learning framework that jointly optimizes an empathy quality reward and a cross-turn tactic novelty signal to reduce repetition of discourse moves.
If this is right
- LLMs can deliver more varied empathic support across multiple turns when trained with explicit novelty incentives.
- Standard similarity metrics fail to detect discourse-level repetition that affects conversation quality.
- Joint quality and diversity rewards outperform either signal used alone.
- Both smaller and larger models show gains, indicating the approach is not limited to scale.
- Effective multi-turn empathy requires planning for strategy variety rather than optimizing each response in isolation.
Where Pith is reading between the lines
- Similar novelty objectives might apply to other long-horizon dialogue tasks such as tutoring or customer service where repetition reduces engagement.
- The results suggest that empathy training corpora alone are insufficient without added diversity constraints.
- Future systems could add explicit multi-turn planning modules to select tactic sequences ahead of time.
- Longer-term user studies would be needed to confirm whether reduced repetition translates to better real-world outcomes.
Load-bearing premise
Discourse tactics can be identified reliably enough by classifiers to serve as training signals, and rewarding their cross-turn novelty will produce more effective support without harming coherence or other unmeasured qualities.
What would settle it
Human raters find no gain in perceived support quality or engagement for MINT dialogues despite lower measured repetition and higher automated empathy scores, or the tactic classifier shows low agreement with human annotations of strategy use.
Figures
read the original abstract
Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs exhibit substantially higher cross-turn reuse of discourse tactics in multi-turn empathic dialogues than humans (0.50-0.56 vs. 0.27), a pattern invisible to standard similarity metrics. It introduces MINT, a reinforcement learning framework that augments an empathy quality reward with a cross-turn tactic novelty signal; the best variant yields a 25.3% gain in aggregate empathy over vanilla LLMs (across 1.7B and 4B models) and a 26.3% reduction in repetition on the 4B model, outperforming quality-only and token-level diversity baselines.
Significance. If the measurement of tactics is reliable, the work identifies a concrete, previously under-examined limitation of LLMs in sustained empathic support and supplies a practical RL recipe that demonstrably increases discourse-move variety while preserving or improving rated empathy. The explicit separation of an external empathy signal from a sequence-level novelty term is a clear methodological strength and supplies falsifiable predictions about multi-turn behavior.
major comments (3)
- [§3] §3 (Tactic Taxonomy and Annotation): the headline reuse statistic (0.50-0.56 vs. 0.27) and the MINT novelty reward both presuppose consistent, unbiased labeling of discourse tactics across human and model dialogues. The manuscript reports neither inter-annotator agreement, human validation of the taxonomy, nor an ablation on label noise, leaving both the empirical claim and the 25.3% empathy gain dependent on an unverified measurement step.
- [§5] §5 (Results and Evaluation): the reported 25.3% empathy improvement and 26.3% repetition reduction are presented as aggregate figures without accompanying statistical tests, confidence intervals, or controls for conversation length, topic distribution, or model scale. It is therefore impossible to assess whether the gains are robust or could be explained by confounds.
- [§4.2] §4.2 (MINT Reward Formulation): the cross-turn novelty term is defined on the same tactic sequences used for the reuse analysis. If tactic labels are produced by an LLM classifier or heuristic that was not independently validated, the RL objective may be optimizing for an artifact of the labeler rather than genuine discourse diversity; no ablation isolating this risk is provided.
minor comments (2)
- [§1] The abstract and §1 cite prior work on single-turn repetition (Gueorguieva et al., 2026) but do not clarify how the multi-turn extension differs methodologically from that baseline.
- [§4] Notation for the novelty reward (e.g., the precise definition of “cross-turn tactic novelty”) is introduced in §4 but is not restated in the results tables, making it difficult to map numbers back to the objective.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have revised the manuscript accordingly to improve clarity, rigor, and completeness.
read point-by-point responses
-
Referee: [§3] §3 (Tactic Taxonomy and Annotation): the headline reuse statistic (0.50-0.56 vs. 0.27) and the MINT novelty reward both presuppose consistent, unbiased labeling of discourse tactics across human and model dialogues. The manuscript reports neither inter-annotator agreement, human validation of the taxonomy, nor an ablation on label noise, leaving both the empirical claim and the 25.3% empathy gain dependent on an unverified measurement step.
Authors: We agree that explicit reporting of inter-annotator agreement and validation is necessary to support the reliability of the tactic annotations. The taxonomy draws from established frameworks in counseling psychology, and annotations followed detailed guidelines applied by multiple trained annotators. In the revised manuscript we now include Cohen's kappa scores demonstrating substantial agreement, along with a human validation study on a held-out set of dialogues. We have also added an ablation introducing controlled label noise to show that the core reuse statistics and MINT gains remain stable. revision: yes
-
Referee: [§5] §5 (Results and Evaluation): the reported 25.3% empathy improvement and 26.3% repetition reduction are presented as aggregate figures without accompanying statistical tests, confidence intervals, or controls for conversation length, topic distribution, or model scale. It is therefore impossible to assess whether the gains are robust or could be explained by confounds.
Authors: We acknowledge that the original presentation lacked formal statistical support and confound controls. The revised manuscript now reports paired statistical tests with p-values, bootstrap-derived confidence intervals for the empathy and repetition metrics, and results stratified by conversation length, topic, and model scale. These additions confirm that the reported improvements are statistically significant and consistent across conditions. revision: yes
-
Referee: [§4.2] §4.2 (MINT Reward Formulation): the cross-turn novelty term is defined on the same tactic sequences used for the reuse analysis. If tactic labels are produced by an LLM classifier or heuristic that was not independently validated, the RL objective may be optimizing for an artifact of the labeler rather than genuine discourse diversity; no ablation isolating this risk is provided.
Authors: We recognize the potential for circularity if the novelty signal were the sole driver. However, the empathy quality reward is computed from an independent scorer trained on human empathy ratings that does not use tactic labels. The joint improvement in both empathy and reduced repetition under MINT supports that the signal captures genuine discourse variation. To isolate the risk, the revision includes an ablation replacing the tactic-based novelty term with an embedding-similarity diversity reward; MINT continues to outperform the baselines on both metrics. revision: yes
Circularity Check
No significant circularity; empirical measurements and RL results are independent of inputs
full rationale
The paper reports new multi-turn empirical statistics on tactic reuse (0.50-0.56 vs. 0.27) and presents MINT as an RL method whose rewards combine an external empathy signal with a novelty term over identified tactics. No equations or derivations reduce the reported gains (25.3% empathy, 26.3% repetition reduction) to parameters fitted on the evaluation data or to the single-turn prior result by construction. The cited Gueorguieva et al. 2026 work is used only for motivation and single-turn background; the central multi-turn findings and training outcomes rest on separate measurements and comparisons to baselines.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Discourse moves in empathic responses can be consistently categorized into discrete tactics whose repetition can be measured across turns.
- domain assumption Reinforcement learning with an added novelty term will increase tactic diversity without degrading other unmeasured aspects of conversation quality.
invented entities (1)
-
MINT (Multi-turn Inter-tactic Novelty Training)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
doi: 10.1609/aaai.v39i13.33569
ISBN 978-1-57735-897-8. doi: 10.1609/aaai.v39i13.33569. URL https://doi.org/10. 1609/aaai.v39i13.33569. Allan Luks and Peggy Payne.The healing power of doing good: The health and spiritual benefits of helping others. iUniverse, 2001. Bethanie Maples, Merve Cerit, Aditya Vishwanath, and Roy Pea. Loneliness and suicide mitigation for students using gpt3-ena...
- [2]
-
[3]
{tactic_definition}←varies per adapter
- [4]
-
[5]
Importantly, the full empathic response is provided as context, and the sentence in question is provided separately. Only the given sentence should be assessed for “{Tactic}”, not the entire response. ###Input: - Context (Full Empathic Response): {Full_Response} - Sentence to Evaluate: {Sentence} ###Response: Figure 5: Shared prompt template for the tacti...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.