Recognition: no theorem link
How Much Does Persuasion Strategy Matter? LLM-Annotated Evidence from Charitable Donation Dialogues
Pith reviewed 2026-05-14 22:22 UTC · model grok-4.3
The pith
Guilt induction is the only persuasion strategy linked to significantly lower donation rates in dialogues, dropping compliance by about 23 percentage points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Labeling all 10,600 persuader turns in the PersuasionForGood corpus with a 41-item strategy taxonomy via three LLMs reveals that strategy categories alone account for little variance in donation outcome. Guilt Induction is the sole strategy significantly tied to lower donation rates, with an effect size of roughly 23 percentage points that replicates across annotators despite only moderate agreement between them. Reciprocity emerges as the most consistent positive correlate, while target sentiment predicts donation occurrence but shows only weak links to donation amount.
What carries the argument
LLM annotation of 41 persuasion strategies across all dialogue turns, followed by statistical regression testing associations with observed donation outcomes while correcting for multiple comparisons.
If this is right
- Most persuasion strategies have negligible measurable impact on whether people donate.
- Guilt-based appeals are associated with fewer donations and may be counterproductive.
- Reciprocity-based approaches show the strongest positive link to donations.
- Recipient sentiment and interest are better predictors of donation occurrence than any strategy label.
- Identifying strategies alone cannot account for most variation in persuasion effectiveness.
Where Pith is reading between the lines
- Persuaders in charity contexts may benefit from avoiding guilt tactics entirely.
- Dialogue features such as timing, sequence, or emotional tone could matter more than discrete strategy labels.
- The same annotation approach could be applied to other prosocial behaviors like volunteering or policy support.
Load-bearing premise
The strategy labels produced by the LLMs are accurate and unbiased enough to support causal-sounding conclusions about real donation behavior.
What would settle it
A human re-annotation of a random subset of the dialogues that yields a different set of significant associations, particularly for guilt induction.
Figures
read the original abstract
Which persuasion strategies, if any, are associated with donation compliance? Answering this requires fine-grained strategy labels across a full corpus and statistical tests corrected for multiple comparisons. We annotate all 10,600 persuader turns in the 1,017-dialogue PersuasionForGood corpus (Wang et al., 2019), where donation outcomes are directly observable, with a taxonomy of 41 strategies in 11 categories, using three open-source large language models (LLMs; Qwen3:30b, Mistral-Small-3.2, Phi-4). Strategy categories alone explain little variance in donation outcome (pseudo $R^2 \approx 0.015$, consistent across all three annotators). Guilt Induction is the only strategy significantly associated with lower donation rates ($\Delta \approx -23$ percentage points), an effect that replicates across all three models despite only moderate inter-model agreement. Reciprocity is the most robust positive correlate. Target sentiment and interest predict whether a donation occurs but show at most a weak correlation with donation amount. These findings suggest that strategy identification alone is insufficient to explain persuasion effectiveness, and that guilt-based appeals may be counterproductive in prosocial settings. We release the fully annotated corpus as a public resource.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript annotates all 10,600 persuader turns in the 1,017-dialogue PersuasionForGood corpus using three open-source LLMs (Qwen3:30b, Mistral-Small-3.2, Phi-4) to label a taxonomy of 41 strategies in 11 categories. Logistic regressions on the resulting binary indicators show that strategy categories explain little variance in donation outcome (pseudo-R² ≈ 0.015 across annotators). Guilt Induction is the only strategy significantly associated with lower donation rates (Δ ≈ -23 percentage points), replicating across models despite moderate inter-model agreement; Reciprocity is the most robust positive correlate. Target sentiment and interest predict donation occurrence but correlate weakly with amount. The authors conclude that strategy identification alone is insufficient to explain persuasion effectiveness and release the annotated corpus publicly.
Significance. If the LLM labels are shown to be sufficiently reliable, the study supplies large-scale empirical evidence on which persuasion strategies are associated with observable donation compliance in real dialogues. The low pseudo-R² and the replicated negative association for Guilt Induction are informative for computational persuasion research, while the public release of the fully labeled corpus constitutes a concrete resource for follow-up work.
major comments (2)
- [Methods (Annotation Procedure)] The moderate inter-model agreement on the 41-strategy labels leaves the headline Guilt Induction coefficient (Δ ≈ -23 pp) vulnerable to annotation noise. A sensitivity analysis or human-validated subset for the guilt category is needed to demonstrate that the reported association is not inflated by systematic mislabeling of guilt-adjacent turns.
- [Results (Statistical Tests)] The statistical analysis section does not specify the multiple-comparison correction procedure, the exact logistic regression specification (including controls for dialogue length, speaker effects, or turn position), or how pseudo-R² was computed. These omissions prevent assessment of whether the significance claims for Guilt Induction and Reciprocity are robust to plausible confounds.
minor comments (1)
- [Abstract] The abstract states that the effect 'replicates across all three models' but does not report the inter-model agreement metric (e.g., Fleiss' kappa or pairwise F1); adding this value would clarify the strength of the replication claim.
Simulated Author's Rebuttal
We are grateful to the referee for their thoughtful review and constructive suggestions. We believe the revisions outlined below will strengthen the manuscript and address the concerns raised. We respond to each major comment in turn.
read point-by-point responses
-
Referee: [Methods (Annotation Procedure)] The moderate inter-model agreement on the 41-strategy labels leaves the headline Guilt Induction coefficient (Δ ≈ -23 pp) vulnerable to annotation noise. A sensitivity analysis or human-validated subset for the guilt category is needed to demonstrate that the reported association is not inflated by systematic mislabeling of guilt-adjacent turns.
Authors: We thank the referee for highlighting this important point. While the moderate agreement is a limitation, the replication of the Guilt Induction effect across three independent models provides evidence that the association is not solely due to annotator-specific noise. Nevertheless, to strengthen the claim, we will include a sensitivity analysis in the revised manuscript. Specifically, we will re-estimate the models using only the subset of turns where at least two of the three LLMs agree on the Guilt Induction label, and report whether the effect persists. We will also discuss the implications of inter-model disagreement in the limitations section. revision: yes
-
Referee: [Results (Statistical Tests)] The statistical analysis section does not specify the multiple-comparison correction procedure, the exact logistic regression specification (including controls for dialogue length, speaker effects, or turn position), or how pseudo-R² was computed. These omissions prevent assessment of whether the significance claims for Guilt Induction and Reciprocity are robust to plausible confounds.
Authors: We appreciate this feedback on the clarity of our statistical reporting. In the revised manuscript, we will explicitly detail: (1) the multiple-comparison correction procedure used (Bonferroni correction across the 11 strategy categories); (2) the full logistic regression specification, including controls for dialogue length, turn position within the dialogue, and fixed effects for individual persuaders where applicable; and (3) the computation of pseudo-R² (we used McFadden's pseudo-R²). We will also add robustness checks with additional controls to address potential confounds. revision: yes
Circularity Check
No significant circularity: direct empirical analysis of external corpus outcomes against LLM labels
full rationale
The paper's chain consists of (1) applying three external LLMs to label an independently collected corpus (PersuasionForGood, Wang et al. 2019) with a fixed 41-strategy taxonomy and (2) running logistic regression of the resulting binary indicators against the corpus's pre-existing donation outcomes. No equation defines a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no central claim rests on a self-citation whose validity is presupposed by the present work. The reported associations (including the Guilt Induction coefficient) are therefore falsifiable against the raw donation data and the LLM outputs; they do not reduce to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions underlying pseudo-R² calculation and multiple-comparison-corrected significance tests in regression models
Reference graph
Works this paper leans on
-
[1]
Introduction Charitable donation conversations are a natural setting for studying persuasion: a persuader at- tempts to convince a target to donate money, and the outcome (donated or not, and how much) is di- rectly observable. The PersuasionForGood corpus (Wang et al., 2019)1 provides 1,017 such dialogues collected via Amazon Mechanical Turk, where per- ...
work page 2019
-
[2]
How Much Does Persuasion Strategy Matter? LLM-Annotated Evidence from Charitable Donation Dialogues
Related Work PersuasionForGood and persuasion in NLP. Wang et al. (2019) introduced the corpus with 10 strategy labels (e.g., logical appeal, emotional ap- peal,credibilityappeal)andanRCNN-basedclassi- fier. Saha et al. (2021) improved classification with BERT-based models, while Tian et al. (2020) ana- lyzed target resistance strategies. Chen and Yang ar...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[3]
hierarchical persuasion strategy classification system
Methodology 3.1. Taxonomy Starting from Cialdini’s (1984) principles of influ- enceandMarwellandSchmitt’s(1967)compliance- gaining strategies, supplemented by work on fear appeals, framing, and emotional manipulation, we compiled 45 candidate strategies. Pilot annotation revealed that several were poorly distinguishable (e.g., overlapping moral and value-...
work page 1984
-
[4]
Kidsaredying from hunger every minute. Don’t you want to help stopthat?
Analysis and Results Weconductanalysesattwolevelsofgranularity. At thecategory level(Section 4.1), we test whether the presence of each of the 11 strategy categories in a dialogue is associated with donation outcome. At theindividual strategy level(Sections 4.2– 4.3), we restrict tests to strategies appearing in at least 20 dialogues (n≥ 20), a minimum-fr...
work page 1966
-
[5]
Discussion and Conclusion Our results indicate that persuasion effectiveness cannotbereducedto“strategyXleadstodonation.” First, strategy categories have limited predictive power (pseudoR2 = 0.011–0.016across all three annotators), challenging the assumption that strat- egy identification alone captures persuasion effec- tiveness. With 4–5 strategies per ...
work page 2025
-
[6]
Limitations Our LLM annotations are produced without fine- tuning; inter-model agreement on fine-grained la- bels is moderate (κ = 0.38–0.54), and annotation noise may attenuate downstream estimates. Key findings(Guiltbackfire, lowcategory R2)arerobust to annotator choice, but weaker effects (e.g., Reci- procity) vary across models. Each turn receives one...
-
[7]
Nonewhumansub- jectsdatawascollected
Ethics Statement This work analyzes existing publicly available dia- loguedata(Wangetal.,2019). Nonewhumansub- jectsdatawascollected. Thepersuasionstrategies we study are from cooperative charitable donation contexts. Findings about persuasion strategy ef- fectiveness could theoretically inform manipulative applications; however, the primary intended use ...
work page 2019
-
[8]
Bibliographical References Suhaib Abdurahman, Alireza Salkhordeh Ziabari, Alexander K. Moore, Daniel M. Bartels, and Morteza Dehghani. 2025. A primer for evalu- ating large language models in social-science research.Advances in Methods and Practices in Psychological Science, 8(2). Jack W. Brehm. 1966.A Theory of Psychological Reactance. Academic Press. Na...
-
[9]
InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4799–4808
Understanding user resistance strate- gies in persuasive conversations. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4799–4808. Association for Computational Linguistics. Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu
work page 2020
-
[10]
Persuasion for good: Towards a personal- ized persuasive dialogue system for social good. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5635–5649. Association for Computational Lin- guistics
-
[11]
Language Resource References A. Full Strategy Taxonomy Table 5 lists all 41 persuasion strategies and 9 con- versation management labels with turn counts. 7 Category Strategy n % Norms / Morality / Values Appeal to Values 543 5.1 Moral Appeal 521 4.9 Guilt Induction 129 1.2 Self-feeling Appeal 138 1.3 Rational / Impact Appeal Rational Appeal 928 8.8 Logic...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.