pith. sign in

arxiv: 2605.22602 · v1 · pith:IAWODIVLnew · submitted 2026-05-21 · 💻 cs.AI

Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

Pith reviewed 2026-05-22 05:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords Theory of MindPersuasive DialogueBDI FrameworkStepwise ReasoningMental State InferenceKnowledge EnhancementDialogue Agents
0
0 comments X

The pith

A dual knowledge-enhanced stepwise reasoning framework lets smaller models outperform GPT-5 at predicting desires, beliefs, and persuasive strategies by modeling their sequential dependencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new ToM-PD task that treats persuasive dialogue as a sequence of mental states drawn from the BDI framework. It releases the ToM-BPD dataset with fine-grained annotations for desires, beliefs, intentions, and the strategies used to influence them. It then introduces TTBYS, a reasoning method that draws on both explicit and implicit prior knowledge to reason in three explicit steps before generating a response. Experiments show that Qwen3-8B with this method surpasses GPT-5 on the three prediction subtasks. The work aims to give dialogue agents more stable and interpretable access to others' mental states.

Core claim

By grounding persuasive dialogue in the BDI framework and supplying both explicit and implicit prior experiences inside a three-step reasoning procedure, the TTBYS method produces more accurate and consistent inferences of desires, beliefs, and persuasive strategies than standard prompting or larger baseline models.

What carries the argument

TTBYS, a dual knowledge-enhanced stepwise reasoning framework that first retrieves explicit knowledge, then implicit experience, then integrates both to trace dependencies among mental states before selecting a strategy.

If this is right

  • Persuasive agents built on TTBYS will maintain consistent mental-state tracking across multiple turns instead of producing fragmented inferences.
  • Smaller open models can reach or exceed closed large-model performance on desire, belief, and strategy prediction when the three-step dual-knowledge procedure is applied.
  • The explicit stepwise trace improves the ability to inspect and debug why an agent chooses one persuasive move over another.
  • The same structure can be reused for other multi-turn social tasks that require tracking latent mental states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the three-step structure generalizes, it could be inserted into existing ToM benchmarks outside persuasion to test whether the dual-knowledge pattern improves performance on non-persuasive social reasoning.
  • The method may lower the compute cost of building socially capable agents by allowing mid-sized models to substitute for much larger ones in dialogue settings.
  • Future work could measure whether the interpretability gains translate into better user outcomes in actual persuasion scenarios such as sales or health coaching.

Load-bearing premise

The BDI framework and the ToM-BPD annotations correctly reflect the actual order and dependencies among mental states that occur in real persuasive conversations.

What would settle it

A new set of human-annotated dialogues in which the order of desire, belief, and intention labels deviates significantly from the BDI sequence assumed by the dataset, or where independent raters disagree with the original mental-state labels at rates above chance.

Figures

Figures reproduced from arXiv: 2605.22602 by Bin Guo, Jingqi Liu, Mengqi Chen, Minghui Ma, Qiuyun Zhang, Runze Yang, Xuehao Ma, Yahan Pei, Yan Liu, Zhiwen Yu.

Figure 1
Figure 1. Figure 1: Illustration of self BDI state evolution and BDI-based inference for ToM-driven persuasive dialogue (ToM-PD). The left panel shows the internal [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall analysis of dialogue strategies, desire, and belief dynamics [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the ToM-PD (left) and the TTBYS framework (right). In the ToM-PD task, the persuader sequentially infers the persuadee’s mental states, including intention, desire, and belief, from the dialogue history, and subsequently selects an appropriate persuasive strategy based on the inferred states. TTBYS operationalizes this process through three explicit reasoning steps, each corresponding to one st… view at source ↗
Figure 4
Figure 4. Figure 4: An example of a ToM-PD Experience. B. First Think: Desire Inference To leverage ToM experiences for deliberative judgment, we first summarize the current dialogue turn (ut, at) as a dialogue summary it. This summary captures the key observable behav￾iors and the inferred intention of the persuadee at this stage. Using it as a query, we retrieve the top-N most semantically similar historical experiences fro… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of the blending coefficients on prediction accuracy. Left: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of experience quantity on desire and strategy prediction [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Relative performance gains of Qwen-3-8B+ours over baseline methods [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Persuasive dialogue requires reasoning about others' latent mental states, a capability known as Theory of Mind (ToM). However, due to reliance on simple prompting strategies and insufficient ToM knowledge, existing LLMs often fail to capture the intrinsic dependencies among mental states, leading to fragmented representations and unstable reasoning. To address these challenges, we introduce the ToM-based Persuasive Dialogue (ToM-PD) task, grounded in the Belief-Desire-Intention (BDI) framework, which explicitly models the sequential dependencies among mental states in multi-turn dialogues. To facilitate research on this task, we construct a large-scale annotated dataset, ToM-based Broad Persuasive Dialogues (ToM-BPD), capturing fine-grained mental states and corresponding persuasive strategies. We further propose Think Thrice Before You Speak (TTBYS), a knowledge-enhanced stepwise reasoning framework that leverages both explicit and implicit prior experiences to improve LLMs' inference of desires, beliefs, and persuasive strategies. Experimental results demonstrate that Qwen3-8B equipped with TTBYS outperforms GPT-5 by 1.20%, 22.80%, and 16.97% in predicting desires, beliefs, and persuasive strategies, respectively. Case studies further show that our approach enhances interpretability and consistency in reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the ToM-PD task grounded in the BDI framework to explicitly model sequential dependencies among beliefs, desires, and intentions in multi-turn persuasive dialogues. It constructs the ToM-BPD dataset with fine-grained annotations of mental states and persuasive strategies, and proposes the TTBYS framework that performs stepwise reasoning augmented by explicit and implicit prior knowledge. Experiments report that Qwen3-8B equipped with TTBYS outperforms GPT-5 by 1.20%, 22.80%, and 16.97% on desire, belief, and strategy prediction tasks, respectively, with additional case studies on interpretability.

Significance. If the dataset annotations prove reliable and the performance gains are shown to be robust, the work could meaningfully advance structured Theory-of-Mind modeling in LLMs for dialogue agents. The BDI-grounded task definition and dual-knowledge enhancement mechanism offer a concrete alternative to ad-hoc prompting, and the new dataset may become a useful resource for evaluating sequential mental-state reasoning.

major comments (2)
  1. [§3 (ToM-BPD Dataset Construction)] §3 (ToM-BPD Dataset Construction): No inter-annotator agreement scores, annotation guideline details, or external validation against independent persuasive dialogue corpora are reported. Because the headline performance deltas rest on the claim that these annotations faithfully encode the intrinsic sequential dependencies among mental states, the absence of such checks leaves open the possibility that measured gains reflect annotation artifacts rather than improved reasoning.
  2. [§4 (Experimental Results)] §4 (Experimental Results): The reported improvements lack accompanying details on prompting templates for GPT-5, exact evaluation metrics, statistical significance tests, variance across runs, or error analysis. The modest 1.20% desire-prediction gain in particular requires these elements to establish that the result is not within noise or tied to a specific test-split distribution.
minor comments (2)
  1. [§2 (TTBYS Framework)] The distinction between 'explicit' and 'implicit' prior experiences in the TTBYS description could be illustrated with a concrete example from the dataset to improve clarity.
  2. [Figures in §4] Figure captions and axis labels should explicitly state the evaluation metric (e.g., accuracy or F1) used for the reported percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of dataset reliability and experimental rigor. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: §3 (ToM-BPD Dataset Construction): No inter-annotator agreement scores, annotation guideline details, or external validation against independent persuasive dialogue corpora are reported. Because the headline performance deltas rest on the claim that these annotations faithfully encode the intrinsic sequential dependencies among mental states, the absence of such checks leaves open the possibility that measured gains reflect annotation artifacts rather than improved reasoning.

    Authors: We agree that reporting inter-annotator agreement and annotation details is essential for validating the ToM-BPD dataset. In the revised manuscript we will add inter-annotator agreement scores, expanded annotation guideline excerpts, and an explicit discussion of the lack of external validation against other corpora (noting that the BDI-grounded scheme is task-specific). These additions will directly address concerns about potential annotation artifacts and better substantiate the sequential mental-state dependencies. revision: yes

  2. Referee: §4 (Experimental Results): The reported improvements lack accompanying details on prompting templates for GPT-5, exact evaluation metrics, statistical significance tests, variance across runs, or error analysis. The modest 1.20% desire-prediction gain in particular requires these elements to establish that the result is not within noise or tied to a specific test-split distribution.

    Authors: We acknowledge that the experimental reporting requires greater detail to establish robustness. We will revise Section 4 to include the full prompting templates for GPT-5 and all baselines, precise metric definitions, statistical significance tests, variance across multiple runs, and an expanded error analysis that examines the sources of the 1.20% desire-prediction improvement. These changes will demonstrate that the gains are not attributable to noise or split-specific effects. revision: yes

Circularity Check

0 steps flagged

Empirical performance gains on new ToM-BPD dataset do not reduce to self-defined quantities or fitted predictions

full rationale

The paper defines the ToM-PD task via the BDI framework, constructs the ToM-BPD dataset with mental-state annotations, introduces the TTBYS stepwise reasoning method, and reports direct empirical deltas (Qwen3-8B + TTBYS vs. GPT-5) on desire/belief/strategy prediction. These measured improvements are standard held-out comparisons and do not arise from any paper-internal equation that equates a claimed prediction to a fitted parameter or prior output by construction. No load-bearing self-citation chain, uniqueness theorem, or ansatz smuggling is invoked for the core claims; the BDI grounding and dataset construction function as definitional inputs rather than derived results. This yields only a minor score reflecting the absence of external validation for annotations, not circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The work rests on the standard BDI framework as a modeling choice and introduces two new constructed artifacts without external falsifiable evidence.

axioms (1)
  • domain assumption The Belief-Desire-Intention (BDI) framework accurately captures sequential dependencies among mental states in multi-turn persuasive dialogues.
    Task definition and dataset annotation are explicitly grounded in BDI per the abstract.
invented entities (2)
  • ToM-PD task no independent evidence
    purpose: Formalize persuasive dialogue under Theory of Mind with explicit mental-state dependencies
    Newly introduced task definition.
  • TTBYS framework no independent evidence
    purpose: Stepwise dual-knowledge reasoning for desire, belief, and strategy inference
    Newly proposed method.

pith-pipeline@v0.9.0 · 5793 in / 1460 out tokens · 47548 ms · 2026-05-22T05:45:13.473812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

  1. [1]

    The elaboration likelihood model of persuasion,

    R. E. Petty and J. T. Cacioppo, “The elaboration likelihood model of persuasion,” inAdvances in experimental social psychology. Elsevier, 1986, vol. 19, pp. 123–205

  2. [2]

    Springer Science & Business Media, 2012

    ——,Communication and persuasion: Central and peripheral routes to attitude change. Springer Science & Business Media, 2012

  3. [3]

    Towards emotional support dialog systems,

    S. Liu, C. Zheng, O. Demasi, S. Sabour, Y . Li, Z. Yu, Y . Jiang, and M. Huang, “Towards emotional support dialog systems,”arXiv preprint arXiv:2106.01144, 2021

  4. [4]

    Escot: To- wards interpretable emotional support dialogue systems,

    T. Zhang, X. Zhang, J. Zhao, L. Zhou, and Q. Jin, “Escot: To- wards interpretable emotional support dialogue systems,”arXiv preprint arXiv:2406.10960, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

  5. [5]

    Pepds: A polite and empathetic persuasive dialogue system for charity donation,

    K. Mishra, A. M. Samad, P. Totala, and A. Ekbal, “Pepds: A polite and empathetic persuasive dialogue system for charity donation,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 424–440

  6. [6]

    Would you like to make a donation? a dialogue system to persuade you to donate,

    Y . Song and H. Wang, “Would you like to make a donation? a dialogue system to persuade you to donate,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 17 707– 17 717

  7. [7]

    Are llms effective negotiators? systematic evaluation of the multifaceted capabilities of llms in negotiation dialogues,

    D. Kwon, E. Weiss, T. Kulshrestha, K. Chawla, G. Lucas, and J. Gratch, “Are llms effective negotiators? systematic evaluation of the multifaceted capabilities of llms in negotiation dialogues,” inFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, 2024, pp. 5391–5413

  8. [8]

    Tomap: Training opponent-aware llm persuaders with theory of mind,

    P. Han, Z. Liu, and J. You, “Tomap: Training opponent-aware llm persuaders with theory of mind,” 2025

  9. [9]

    Does the chimpanzee have a theory of mind?

    D. Premack and G. Woodruff, “Does the chimpanzee have a theory of mind?”Behavioral and brain sciences, vol. 1, no. 4, pp. 515–526, 1978

  10. [10]

    Does the autistic child have a “theory of mind

    S. Baron-Cohen, A. M. Leslie, and U. Frith, “Does the autistic child have a “theory of mind”?”Cognition, vol. 21, no. 1, pp. 37–46, 1985

  11. [11]

    Cooper: Coordinating specialized agents towards a complex dialogue goal,

    Y . Cheng, W. Liu, J. Wang, C. T. Leong, Y . Ouyang, W. Li, X. Wu, and Y . Zheng, “Cooper: Coordinating specialized agents towards a complex dialogue goal,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 17 853–17 861

  12. [12]

    Injecting salesperson’s dialogue strategies in large language models with chain-of-thought reasoning,

    W.-Y . Chang and Y .-N. Chen, “Injecting salesperson’s dialogue strategies in large language models with chain-of-thought reasoning,” 2024

  13. [13]

    Negotiationtom: A benchmark for stress-testing machine theory of mind on negotiation surrounding,

    C. Chan, C. Jiayang, Y . Yim, Z. Deng, W. Fan, H. Li, X. Liu, H. Zhang, W. Wang, and Y . Song, “Negotiationtom: A benchmark for stress-testing machine theory of mind on negotiation surrounding,”arXiv preprint arXiv:2404.13627, 2024

  14. [14]

    Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind,

    K. Shinoda, N. Hojo, K. Nishida, S. Mizuno, K. Suzuki, R. Masumura, H. Sugiyama, and K. Saito, “Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 1520– 1528

  15. [15]

    Under- standing social reasoning in language models with language models,

    K. Gandhi, J.-P. Fr ¨anken, T. Gerstenberg, and N. Goodman, “Under- standing social reasoning in language models with language models,” Advances in Neural Information Processing Systems, vol. 36, pp. 13 518–13 529, 2023

  16. [16]

    Persuasivetom: A bench- mark for evaluating machine theory of mind in persuasive dialogues,

    F. Yu, L. Jiang, S. Huang, Z. Wu, and X. Dai, “Persuasivetom: A bench- mark for evaluating machine theory of mind in persuasive dialogues,” arXiv preprint arXiv:2502.21017, 2025

  17. [17]

    The belief-desire-intention model of agency,

    M. Georgeff, B. Pell, M. Pollack, M. Tambe, and M. Wooldridge, “The belief-desire-intention model of agency,” inInternational workshop on agent theories, architectures, and languages. Springer, 1998, pp. 1–10

  18. [18]

    Episodic memory development: Theory of mind is part of re-experiencing experienced events,

    J. Perner, D. Kloo, and E. Gornik, “Episodic memory development: Theory of mind is part of re-experiencing experienced events,”Infant and Child Development: An International Journal of Research and Practice, vol. 16, no. 5, pp. 471–490, 2007

  19. [19]

    Effects of persuasive dialogues: testing bot identities and inquiry strategies,

    W. Shi, X. Wang, Y . J. Oh, J. Zhang, S. Sahay, and Z. Yu, “Effects of persuasive dialogues: testing bot identities and inquiry strategies,” in Proceedings of the 2020 CHI conference on human factors in computing systems, 2020, pp. 1–13

  20. [20]

    A multi- appeal model of persuasion for online petition success: A linguistic cue- based approach,

    Y . Chen, S. Deng, D.-H. Kwak, A. Elnoshokaty, and J. Wu, “A multi- appeal model of persuasion for online petition success: A linguistic cue- based approach,”Journal of the Association for Information Systems, vol. 20, no. 2, p. 3, 2019

  21. [21]

    Persuasion for good: Towards a personalized persuasive dialogue system for social good

    X. Wang, W. Shi, R. Kim, Y . Oh, S. Yang, J. Zhang, and Z. Yu, “Per- suasion for good: Towards a personalized persuasive dialogue system for social good,”arXiv preprint arXiv:1906.06725, 2019

  22. [22]

    Towards personalized conversational sales agents: Contextual user profiling for strategic ac- tion,

    T. Kim, J. Lee, S. Yoon, S. Kim, and D. Lee, “Towards personalized conversational sales agents: Contextual user profiling for strategic ac- tion,”arXiv preprint arXiv:2504.08754, 2025

  23. [23]

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,

    Y . Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 14 322–14 350

  24. [24]

    The earth is flat because...: Investigating llms’ belief towards misinformation via persuasive conversation,

    R. Xu, B. Lin, S. Yang, T. Zhang, W. Shi, T. Zhang, Z. Fang, W. Xu, and H. Qiu, “The earth is flat because...: Investigating llms’ belief towards misinformation via persuasive conversation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 16 259–16 303

  25. [25]

    Zero-shot persuasive chatbots with llm-generated strategies and information retrieval,

    K. Furumai, R. Legaspi, J. C. V . Romero, Y . Yamazaki, Y . Nishimura, S. Semnani, K. Ikeda, W. Shi, and M. Lam, “Zero-shot persuasive chatbots with llm-generated strategies and information retrieval,” in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 11 224–11 249

  26. [26]

    Improving multi-turn emotional support dialogue generation with lookahead strategy planning,

    Y . Cheng, W. Liu, W. Li, J. Wang, R. Zhao, B. Liu, X. Liang, and Y . Zheng, “Improving multi-turn emotional support dialogue generation with lookahead strategy planning,”arXiv preprint arXiv:2210.04242, 2022

  27. [27]

    Cem: Commonsense-aware empathetic response generation,

    S. Sabour, C. Zheng, and M. Huang, “Cem: Commonsense-aware empathetic response generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 11 229–11 237

  28. [28]

    Knowledge-enhanced mixed- initiative dialogue system for emotional support conversations,

    Y . Deng, W. Zhang, Y . Yuan, and W. Lam, “Knowledge-enhanced mixed- initiative dialogue system for emotional support conversations,”arXiv preprint arXiv:2305.10172, 2023

  29. [29]

    Knowledge-enhanced memory model for emotional support conversation,

    M. Jia, Q. Chen, L. Jing, D. Fu, and R. Li, “Knowledge-enhanced memory model for emotional support conversation,”arXiv preprint arXiv:2310.07700, 2023

  30. [30]

    Improving knowledge gain and emotional experience in online learning with knowledge and emotional scaffolding-based conversational agent,

    Z. Liu, H. Duan, S. Liu, R. Mu, S. Liu, and Z. Yang, “Improving knowledge gain and emotional experience in online learning with knowledge and emotional scaffolding-based conversational agent,”Ed- ucational Technology & Society, vol. 27, no. 2, pp. 197–219, 2024

  31. [31]

    Toward real-world chinese psychological support dialogues: Cpsdd dataset and a co-evolving multi-agent system,

    Y . Shi, L. Zhang, and F. Kong, “Toward real-world chinese psychological support dialogues: Cpsdd dataset and a co-evolving multi-agent system,” arXiv preprint arXiv:2507.07509, 2025

  32. [32]

    Entering real social world! benchmarking the theory of mind and socialization capabilities of llms from a first-person perspective. arxiv 2024,

    G. Hou, W. Zhang, Y . Shen, Z. Tan, S. Shen, and W. Lu, “Entering real social world! benchmarking the theory of mind and socialization capabilities of llms from a first-person perspective. arxiv 2024,”arXiv preprint arXiv:2410.06195, 2024

  33. [33]

    Think twice: Perspective-taking improves large language models’ theory-of-mind ca- pabilities,

    A. Wilf, S. Lee, P. P. Liang, and L.-P. Morency, “Think twice: Perspective-taking improves large language models’ theory-of-mind ca- pabilities,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8292– 8308

  34. [34]

    Let’s put ourselves in sally’s shoes: Shoes-of-others prefixing improves theory of mind in large language models,

    K. Shinoda, N. Hojo, K. Nishida, Y . Yamazaki, K. Suzuki, H. Sugiyama, and K. Saito, “Let’s put ourselves in sally’s shoes: Shoes-of-others prefixing improves theory of mind in large language models,”arXiv preprint arXiv:2506.05970, 2025

  35. [35]

    A notion of complexity for theory of mind via discrete world models,

    X. A. Huang, E. La Malfa, S. Marro, A. Asperti, A. G. Cohn, and M. J. Wooldridge, “A notion of complexity for theory of mind via discrete world models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 2964–2983

  36. [36]

    Hypothet- ical minds: Scaffolding theory of mind for multi-agent tasks with large language models,

    L. Cross, V . Xiang, A. Bhatia, D. L. Yamins, and N. Haber, “Hypothet- ical minds: Scaffolding theory of mind for multi-agent tasks with large language models,”arXiv preprint arXiv:2407.07086, 2024

  37. [37]

    Minding language models’(lack of) theory of mind: A plug-and-play multi-character belief tracker,

    M. Sclar, S. Kumar, P. West, A. Suhr, Y . Choi, and Y . Tsvetkov, “Minding language models’(lack of) theory of mind: A plug-and-play multi-character belief tracker,”arXiv preprint arXiv:2306.00924, 2023

  38. [38]

    The neuro-symbolic inverse planning engine (nipe): Modeling probabilistic social inferences from linguistic inputs,

    L. Ying, K. M. Collins, M. Wei, C. E. Zhang, T. Zhi-Xuan, A. Weller, J. B. Tenenbaum, and L. Wong, “The neuro-symbolic inverse planning engine (nipe): Modeling probabilistic social inferences from linguistic inputs,”arXiv preprint arXiv:2306.14325, 2023

  39. [39]

    Metamind: Modeling human social thoughts with metacognitive multi-agent systems,

    X. Zhang, Y . Chen, S. Yeh, and S. Li, “Metamind: Modeling human social thoughts with metacognitive multi-agent systems,”arXiv preprint arXiv:2505.18943, 2025

  40. [40]

    Motivational interviewing third edition: helping people change,

    W. Miller and S. Rollnick, “Motivational interviewing third edition: helping people change,”New York: Guilford, 2013

  41. [41]

    The future of cognitive strategy-enhanced persuasive dialogue agents: new perspectives and trends,

    M. Chen, B. Guo, H. Wang, H. Li, Q. Zhao, J. Liu, Y . Ding, Y . Pan, and Z. Yu, “The future of cognitive strategy-enhanced persuasive dialogue agents: new perspectives and trends,”Frontiers of Computer Science, vol. 19, no. 5, p. 195315, 2025

  42. [42]

    Plug-and-play policy planner for large language model powered dialogue agents,

    Y . Deng, W. Zhang, W. Lam, S.-K. Ng, and T.-S. Chua, “Plug-and-play policy planner for large language model powered dialogue agents,”arXiv preprint arXiv:2311.00262, 2023

  43. [43]

    Dream to chat: Model-based reinforcement learning on dialogues with user belief modeling,

    Y . Zhao, X. Wang, D. Wang, Z. Jiang, Q. Gu, T. Chen, N. Xi, J. Qu, Y . Chen, and L. Ji, “Dream to chat: Model-based reinforcement learning on dialogues with user belief modeling,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 4764–4781

  44. [44]

    Neuro-sym supporter: A thoughtful emotion support agent integrating neural and symbolic policy learning,

    M. Ma, B. Guo, M. Chen, J. Liu, Y . Ding, Y . Liu, and H. Wang, “Neuro-sym supporter: A thoughtful emotion support agent integrating neural and symbolic policy learning,” inProceedings of the ACM Web Conference 2026, 2026, pp. 3823–3834

  45. [45]

    METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues

    H. Yang, J. Liu, C. Huang, F. Wu, W. Lei, and S.-K. Ng, “Metro: Towards strategy induction from expert dialogue transcripts for non- collaborative dialogues,”arXiv preprint arXiv:2604.11427, 2026

  46. [46]

    Large lan- guage models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,”Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

  47. [47]

    Emobench: Evaluating the emotional intel- ligence of large language models,

    S. Sabour, S. Liu, Z. Zhang, J. Liu, J. Zhou, A. Sunaryo, T. Lee, R. Mihalcea, and M. Huang, “Emobench: Evaluating the emotional intel- ligence of large language models,” inProceedings of the 62nd Annual JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. ...

  48. [48]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  49. [49]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  50. [50]

    Mixtral of Experts

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024. APPENDIX This section presents the prompts used in our experiments. Section A describes the prompts used for automatic annotation, Section B presents the promp...

  51. [51]

    Prompt for Vanilla Prompting (Desire Prediction) Current conversation:<dialogue history> Based on the above conversation, classify the persuadee’s desire

    Prompt for Vanilla Zero-shot Prompting:The vanilla zero-shot prompts for predicting desire, belief, and strategy are presented as follows. Prompt for Vanilla Prompting (Desire Prediction) Current conversation:<dialogue history> Based on the above conversation, classify the persuadee’s desire. Choose exactly one option: A. Unwilling B. Uncertain C. Willing...

  52. [52]

    Prompt for CoT Prompting (Desire Prediction) Prompt for CoT prompting: Current conversation:<dialogue history> Based on the above conversation, classify the persuadee’s desire

    Prompt for CoT prompting:The CoT prompts for pre- dicting desire, belief, and strategy are presented as follows. Prompt for CoT Prompting (Desire Prediction) Prompt for CoT prompting: Current conversation:<dialogue history> Based on the above conversation, classify the persuadee’s desire. Think step by step to answer the question. End your response with: ...

  53. [53]

    The prompt for predicting belief is as follows

    Prompt for TTBYS:TTBYS uses vanilla zero-shot prompting to predict desire and strategy. The prompt for predicting belief is as follows. Prompt for TTBYS (Belief Prediction) Relevant Experience:<top relevant experience> Infer the persuadee’s belief in the current conversation context based on the prediction method in relevant experiences. Current conversat...

  54. [54]

    Prompt for Belief Evaluation You are an evaluator

    Prompt for evaluation:We utilize a large language model as an evaluator to assess the belief prediction accuracy of TTBYS, using the prompt as follows. Prompt for Belief Evaluation You are an evaluator. Your task is to evaluate the accuracy of belief prediction based on the following rules:

  55. [55]

    If the predicted positive and negative beliefs fully match the ground truth, score = 1

  56. [56]

    If both positive and negative beliefs are mentioned but the underlying reasons are not fully correct, score = 0.5

  57. [57]

    If both are incorrect, score = 0

  58. [58]

    - Otherwise, score = 0

    If the ground truth belief only contains a positive OR only a negative belief: - If the prediction matches, score = 0.5. - Otherwise, score = 0. Ground truth belief:<gt_belief> Predicted belief:<pred_belief> Output ONLY a number in{0, 0.5, 1}. C. Prompt for Interactive Evaluation The prompts used for the interactive experiments, includ- ing GPT-5, GPT-5 +...

  59. [59]

    comprehensive facilities, affordable pricing, and en- couraging long-term exercise habits

    We observed that these experiences closely resemble the current context, especially the top-3experiences in case 1, JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16 which are highly similar to the statements in Case 1. The concise belief prediction patterns in case 2 also guided the LLM to produce belief more aligned with the ground truth. Cas...