arxiv: 2604.19783 · v1 · submitted 2026-03-30 · 💻 cs.CL

Recognition: no theorem link

How Much Does Persuasion Strategy Matter? LLM-Annotated Evidence from Charitable Donation Dialogues

Tatiana Petrova , Stanislav Sokol , Radu State

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords persuasion strategiescharitable donationsLLM annotationdialogue analysisdonation complianceguilt inductionreciprocity

0 comments

The pith

Guilt induction is the only persuasion strategy linked to significantly lower donation rates in dialogues, dropping compliance by about 23 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper annotates every persuader turn across 1,017 donation dialogues using three LLMs to apply a 41-strategy taxonomy. Broad strategy categories explain almost none of the variance in whether a donation occurs. Only guilt induction stands out, showing a consistent negative association with donation rates that holds across all three models. Recipient sentiment and interest predict whether giving happens at all, but have little relation to the amount given. These patterns indicate that strategy choice by itself is a weak explanation for persuasion success in prosocial settings.

Core claim

Labeling all 10,600 persuader turns in the PersuasionForGood corpus with a 41-item strategy taxonomy via three LLMs reveals that strategy categories alone account for little variance in donation outcome. Guilt Induction is the sole strategy significantly tied to lower donation rates, with an effect size of roughly 23 percentage points that replicates across annotators despite only moderate agreement between them. Reciprocity emerges as the most consistent positive correlate, while target sentiment predicts donation occurrence but shows only weak links to donation amount.

What carries the argument

LLM annotation of 41 persuasion strategies across all dialogue turns, followed by statistical regression testing associations with observed donation outcomes while correcting for multiple comparisons.

If this is right

Most persuasion strategies have negligible measurable impact on whether people donate.
Guilt-based appeals are associated with fewer donations and may be counterproductive.
Reciprocity-based approaches show the strongest positive link to donations.
Recipient sentiment and interest are better predictors of donation occurrence than any strategy label.
Identifying strategies alone cannot account for most variation in persuasion effectiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Persuaders in charity contexts may benefit from avoiding guilt tactics entirely.
Dialogue features such as timing, sequence, or emotional tone could matter more than discrete strategy labels.
The same annotation approach could be applied to other prosocial behaviors like volunteering or policy support.

Load-bearing premise

The strategy labels produced by the LLMs are accurate and unbiased enough to support causal-sounding conclusions about real donation behavior.

What would settle it

A human re-annotation of a random subset of the dialogues that yields a different set of significant associations, particularly for guilt induction.

Figures

Figures reproduced from arXiv: 2604.19783 by Radu State, Stanislav Sokol, Tatiana Petrova.

**Figure 1.** Figure 1: Distribution of 11 persuasion strategy categories (Qwen3:30b). Each bar shows the number of persuader turns assigned to the category; percentages indicate the category’s share of all N=10,600 persuader turns. Conversation Management turns (44.0%) are omitted. (2025): one model serves as the primary annotator and the others as independent replications. Second, all three are fully open-weight models deplo… view at source ↗

**Figure 2.** Figure 2: (a) Donation rates in dialogues containing Guilt Induction ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Which persuasion strategies, if any, are associated with donation compliance? Answering this requires fine-grained strategy labels across a full corpus and statistical tests corrected for multiple comparisons. We annotate all 10,600 persuader turns in the 1,017-dialogue PersuasionForGood corpus (Wang et al., 2019), where donation outcomes are directly observable, with a taxonomy of 41 strategies in 11 categories, using three open-source large language models (LLMs; Qwen3:30b, Mistral-Small-3.2, Phi-4). Strategy categories alone explain little variance in donation outcome (pseudo $R^2 \approx 0.015$, consistent across all three annotators). Guilt Induction is the only strategy significantly associated with lower donation rates ($\Delta \approx -23$ percentage points), an effect that replicates across all three models despite only moderate inter-model agreement. Reciprocity is the most robust positive correlate. Target sentiment and interest predict whether a donation occurs but show at most a weak correlation with donation amount. These findings suggest that strategy identification alone is insufficient to explain persuasion effectiveness, and that guilt-based appeals may be counterproductive in prosocial settings. We release the fully annotated corpus as a public resource.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLM-annotated strategies explain almost none of the donation variance in the PersuasionForGood corpus, with guilt induction as the main negative signal that holds across three models.

read the letter

The main thing to know is that strategy categories explain very little of whether donations happen, with a pseudo-R2 of about 0.015 across all three LLMs, while guilt induction stands out with a roughly 23-point drop in donation rates that replicates. The work annotates the full 10,600 turns from the 2019 corpus using Qwen3, Mistral-Small, and Phi-4 on a 41-strategy taxonomy and releases the labeled data publicly. That release and the cross-model consistency are the practical upsides. The analysis also notes that reciprocity looks positive and that target sentiment predicts occurrence more than amount. These are straightforward extensions of the original dataset rather than a complete rethink of persuasion research. The moderate inter-model agreement is the clearest limitation. Without human validation numbers or details on how label noise was checked, the guilt coefficient could be inflated or attenuated by systematic mislabeling of similar utterances. The abstract is thin on regression specs, multiple-comparison handling, and possible confounds like dialogue length, so those need to be solid in the full text. This is useful for people building dialogue systems or studying prosocial persuasion who want a ready-made large labeled set to test against. It is not a high-impact theoretical advance but gives empirical grounding to the idea that strategy labels alone are weak predictors. I would send it to peer review because the data is public and the setup is replicable, even if the conclusions stay tentative until the methods are verified.

Referee Report

2 major / 1 minor

Summary. The manuscript annotates all 10,600 persuader turns in the 1,017-dialogue PersuasionForGood corpus using three open-source LLMs (Qwen3:30b, Mistral-Small-3.2, Phi-4) to label a taxonomy of 41 strategies in 11 categories. Logistic regressions on the resulting binary indicators show that strategy categories explain little variance in donation outcome (pseudo-R² ≈ 0.015 across annotators). Guilt Induction is the only strategy significantly associated with lower donation rates (Δ ≈ -23 percentage points), replicating across models despite moderate inter-model agreement; Reciprocity is the most robust positive correlate. Target sentiment and interest predict donation occurrence but correlate weakly with amount. The authors conclude that strategy identification alone is insufficient to explain persuasion effectiveness and release the annotated corpus publicly.

Significance. If the LLM labels are shown to be sufficiently reliable, the study supplies large-scale empirical evidence on which persuasion strategies are associated with observable donation compliance in real dialogues. The low pseudo-R² and the replicated negative association for Guilt Induction are informative for computational persuasion research, while the public release of the fully labeled corpus constitutes a concrete resource for follow-up work.

major comments (2)

[Methods (Annotation Procedure)] The moderate inter-model agreement on the 41-strategy labels leaves the headline Guilt Induction coefficient (Δ ≈ -23 pp) vulnerable to annotation noise. A sensitivity analysis or human-validated subset for the guilt category is needed to demonstrate that the reported association is not inflated by systematic mislabeling of guilt-adjacent turns.
[Results (Statistical Tests)] The statistical analysis section does not specify the multiple-comparison correction procedure, the exact logistic regression specification (including controls for dialogue length, speaker effects, or turn position), or how pseudo-R² was computed. These omissions prevent assessment of whether the significance claims for Guilt Induction and Reciprocity are robust to plausible confounds.

minor comments (1)

[Abstract] The abstract states that the effect 'replicates across all three models' but does not report the inter-model agreement metric (e.g., Fleiss' kappa or pairwise F1); adding this value would clarify the strength of the replication claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thoughtful review and constructive suggestions. We believe the revisions outlined below will strengthen the manuscript and address the concerns raised. We respond to each major comment in turn.

read point-by-point responses

Referee: [Methods (Annotation Procedure)] The moderate inter-model agreement on the 41-strategy labels leaves the headline Guilt Induction coefficient (Δ ≈ -23 pp) vulnerable to annotation noise. A sensitivity analysis or human-validated subset for the guilt category is needed to demonstrate that the reported association is not inflated by systematic mislabeling of guilt-adjacent turns.

Authors: We thank the referee for highlighting this important point. While the moderate agreement is a limitation, the replication of the Guilt Induction effect across three independent models provides evidence that the association is not solely due to annotator-specific noise. Nevertheless, to strengthen the claim, we will include a sensitivity analysis in the revised manuscript. Specifically, we will re-estimate the models using only the subset of turns where at least two of the three LLMs agree on the Guilt Induction label, and report whether the effect persists. We will also discuss the implications of inter-model disagreement in the limitations section. revision: yes
Referee: [Results (Statistical Tests)] The statistical analysis section does not specify the multiple-comparison correction procedure, the exact logistic regression specification (including controls for dialogue length, speaker effects, or turn position), or how pseudo-R² was computed. These omissions prevent assessment of whether the significance claims for Guilt Induction and Reciprocity are robust to plausible confounds.

Authors: We appreciate this feedback on the clarity of our statistical reporting. In the revised manuscript, we will explicitly detail: (1) the multiple-comparison correction procedure used (Bonferroni correction across the 11 strategy categories); (2) the full logistic regression specification, including controls for dialogue length, turn position within the dialogue, and fixed effects for individual persuaders where applicable; and (3) the computation of pseudo-R² (we used McFadden's pseudo-R²). We will also add robustness checks with additional controls to address potential confounds. revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct empirical analysis of external corpus outcomes against LLM labels

full rationale

The paper's chain consists of (1) applying three external LLMs to label an independently collected corpus (PersuasionForGood, Wang et al. 2019) with a fixed 41-strategy taxonomy and (2) running logistic regression of the resulting binary indicators against the corpus's pre-existing donation outcomes. No equation defines a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no central claim rests on a self-citation whose validity is presupposed by the present work. The reported associations (including the Guilt Induction coefficient) are therefore falsifiable against the raw donation data and the LLM outputs; they do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical assumptions for regression and significance testing plus the reliability of LLM annotations; no free parameters are fitted to produce the key deltas, and no new entities are postulated.

axioms (1)

standard math Standard assumptions underlying pseudo-R² calculation and multiple-comparison-corrected significance tests in regression models
Invoked to interpret the 0.015 R² and the reported associations.

pith-pipeline@v0.9.0 · 5526 in / 1404 out tokens · 92712 ms · 2026-05-14T22:22:12.569588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Donationinformation

Introduction Charitable donation conversations are a natural setting for studying persuasion: a persuader at- tempts to convince a target to donate money, and the outcome (donated or not, and how much) is di- rectly observable. The PersuasionForGood corpus (Wang et al., 2019)1 provides 1,017 such dialogues collected via Amazon Mechanical Turk, where per- ...

work page 2019
[2]

How Much Does Persuasion Strategy Matter? LLM-Annotated Evidence from Charitable Donation Dialogues

Related Work PersuasionForGood and persuasion in NLP. Wang et al. (2019) introduced the corpus with 10 strategy labels (e.g., logical appeal, emotional ap- peal,credibilityappeal)andanRCNN-basedclassi- fier. Saha et al. (2021) improved classification with BERT-based models, while Tian et al. (2020) ana- lyzed target resistance strategies. Chen and Yang ar...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

hierarchical persuasion strategy classification system

Methodology 3.1. Taxonomy Starting from Cialdini’s (1984) principles of influ- enceandMarwellandSchmitt’s(1967)compliance- gaining strategies, supplemented by work on fear appeals, framing, and emotional manipulation, we compiled 45 candidate strategies. Pilot annotation revealed that several were poorly distinguishable (e.g., overlapping moral and value-...

work page 1984
[4]

Kidsaredying from hunger every minute. Don’t you want to help stopthat?

Analysis and Results Weconductanalysesattwolevelsofgranularity. At thecategory level(Section 4.1), we test whether the presence of each of the 11 strategy categories in a dialogue is associated with donation outcome. At theindividual strategy level(Sections 4.2– 4.3), we restrict tests to strategies appearing in at least 20 dialogues (n≥ 20), a minimum-fr...

work page 1966
[5]

strategyXleadstodonation

Discussion and Conclusion Our results indicate that persuasion effectiveness cannotbereducedto“strategyXleadstodonation.” First, strategy categories have limited predictive power (pseudoR2 = 0.011–0.016across all three annotators), challenging the assumption that strat- egy identification alone captures persuasion effec- tiveness. With 4–5 strategies per ...

work page 2025
[6]

Key findings(Guiltbackfire, lowcategory R2)arerobust to annotator choice, but weaker effects (e.g., Reci- procity) vary across models

Limitations Our LLM annotations are produced without fine- tuning; inter-model agreement on fine-grained la- bels is moderate (κ = 0.38–0.54), and annotation noise may attenuate downstream estimates. Key findings(Guiltbackfire, lowcategory R2)arerobust to annotator choice, but weaker effects (e.g., Reci- procity) vary across models. Each turn receives one...

work page
[7]

Nonewhumansub- jectsdatawascollected

Ethics Statement This work analyzes existing publicly available dia- loguedata(Wangetal.,2019). Nonewhumansub- jectsdatawascollected. Thepersuasionstrategies we study are from cooperative charitable donation contexts. Findings about persuasion strategy ef- fectiveness could theoretically inform manipulative applications; however, the primary intended use ...

work page 2019
[8]

Moore, Daniel M

Bibliographical References Suhaib Abdurahman, Alireza Salkhordeh Ziabari, Alexander K. Moore, Daniel M. Bartels, and Morteza Dehghani. 2025. A primer for evalu- ating large language models in social-science research.Advances in Methods and Practices in Psychological Science, 8(2). Jack W. Brehm. 1966.A Theory of Psychological Reactance. Academic Press. Na...

work page arXiv 2025
[9]

InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4799–4808

Understanding user resistance strate- gies in persuasive conversations. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4799–4808. Association for Computational Linguistics. Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu

work page 2020
[10]

InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5635–5649

Persuasion for good: Towards a personal- ized persuasive dialogue system for social good. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5635–5649. Association for Computational Lin- guistics

work page
[11]

Full Strategy Taxonomy Table 5 lists all 41 persuasion strategies and 9 con- versation management labels with turn counts

Language Resource References A. Full Strategy Taxonomy Table 5 lists all 41 persuasion strategies and 9 con- versation management labels with turn counts. 7 Category Strategy n % Norms / Morality / Values Appeal to Values 543 5.1 Moral Appeal 521 4.9 Guilt Induction 129 1.2 Self-feeling Appeal 138 1.3 Rational / Impact Appeal Rational Appeal 928 8.8 Logic...

work page