Prosocial Persuasion at Scale? Large Language Models Outperform Humans in Donation Appeals Across Levels of Personalization

Bennett Kleinberg; John Caffier; Olga Stavrova

arxiv: 2604.03202 · v1 · submitted 2026-04-03 · 💻 cs.CY

Prosocial Persuasion at Scale? Large Language Models Outperform Humans in Donation Appeals Across Levels of Personalization

John Caffier , Olga Stavrova , Bennett Kleinberg This is my paper

Pith reviewed 2026-05-13 17:57 UTC · model grok-4.3

classification 💻 cs.CY

keywords large language modelsdonation appealsprosocial persuasionpersonalizationcharitable givingAI-generated contentonline experimentspersuasiveness

0 comments

The pith

LLM-generated donation appeals produced more donations, higher engagement, and stronger persuasiveness ratings than human-written ones in two experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can generate effective appeals for charitable donations and whether they match or exceed human performance when personalization is varied. Across two preregistered studies, participants allocated more of an experimental bonus to charities after reading LLM content, spent more time with the appeals, and rated them as more persuasive. Personalization boosted donations in one experiment while false personalization reduced them in the other. The results indicate that LLMs can be used to produce content that encourages prosocial giving at scale.

Core claim

LLM-authored donation appeals outperform human-authored appeals: participants gave more of their bonus to the charities, showed greater engagement with the material, and judged the appeals more persuasive when the text came from an LLM rather than a human author. This advantage held across generic, personalized, and falsely personalized versions of the appeals.

What carries the argument

Experimental contrast between human-written and LLM-generated donation appeals, crossed with three levels of personalization, measured by actual bonus allocation, reading time, and self-reported persuasiveness.

If this is right

Charities and nonprofits could generate large volumes of donation appeals with LLMs and expect higher response rates than from equivalent human writing.
Accurate personalization increases giving while inaccurate personalization decreases it, so organizations must verify data before using it in appeals.
LLMs shift from being viewed primarily as risks for misinformation to tools that can promote positive social behaviors such as charitable giving.
The same generation approach may apply to other prosocial requests such as volunteering or blood donation appeals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If LLM performance holds in real-world campaigns, organizations might reduce reliance on human copywriters for routine fundraising materials.
Similar experiments could test whether the advantage extends to other domains like environmental appeals or health-behavior messages.
Long-term use of LLM appeals might change donor expectations, making highly polished text the new baseline for what feels authentic.

Load-bearing premise

Allocating a small experimental bonus accurately reflects how people donate their own money in real charitable situations, and that human and LLM texts were produced under comparable effort and quality standards.

What would settle it

A field study in which participants use their own funds to donate after seeing either human or LLM appeals and show equal or lower giving for the LLM versions.

read the original abstract

Large Language Models (LLMs) are increasingly regarded as having the potential to generate persuasive content at scale. While previous studies have focused on the risks associated with LLM-generated misinformation, the role of LLMs in enabling prosocial persuasion is still underexplored. We investigate whether donation appeals authored by LLMs are as effective as those written by humans across degrees of personalization. Two preregistered online experiments (Study 1: N = 658; Study 2: N = 642) manipulated Personalization (generic vs. personalized vs. falsely personalized) and Content source (human vs. LLM) and presented participants with donation appeals for charities. We assessed how participants distributed their bonus money across the charities, how they engaged with the donation appeals, and how persuasive they found them. In both experiments, LLM-generated content yielded more donations, resulted in higher engagement, and was rated as more persuasive than human-authored content. There was a gain associated with personalization (Study 2) and a penalty for false personalization (Study 1). Our results suggest that LLMs may be a suitable technology for generating content that can encourage prosocial behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM appeals got more donations than human ones, but the human baseline needs tighter documentation on effort and expertise.

read the letter

The main result is that LLM-generated donation appeals produced more actual donations from participants' bonuses, higher engagement, and better persuasiveness ratings than human-written ones in two preregistered experiments with roughly 650 people each. Personalization added a small lift in one study while false personalization created a penalty in the other. This is a clean head-to-head on prosocial content rather than the usual focus on misinformation risks. The behavioral donation measure is a step up from pure self-report scales, and the factorial design keeps the personalization variable clear. Sample sizes are adequate for the claims. The work is straightforward empirical work that extends existing persuasion studies without circularity or heavy modeling assumptions. The soft spot is the human content side. The description of how the human appeals were written is high-level only, with little on writer experience, time per piece, compensation, or editing rounds. If the humans operated under tighter constraints than the LLM prompting process, the gap is harder to attribute cleanly to the model. The online bonus allocation is a reasonable proxy but still not identical to real out-of-pocket giving. Stats and effect sizes are not in the abstract, so the full paper needs to show those details hold up. This is useful for people working on AI for fundraising or health messaging. It is worth sending for peer review so the methods can be tightened and the comparison made more transparent.

Referee Report

1 major / 1 minor

Summary. The paper reports two preregistered online experiments (Study 1: N=658; Study 2: N=642) that manipulate content source (human vs. LLM) and personalization level (generic vs. personalized vs. falsely personalized) in donation appeals for charities. Participants allocated a bonus across charities while measures of actual donations, engagement, and perceived persuasiveness were collected. The central finding is that LLM-generated appeals produced higher donations, greater engagement, and higher persuasiveness ratings than human-authored appeals in both studies, with a personalization benefit in Study 2 and a false-personalization penalty in Study 1.

Significance. If the results hold after clarifying production constraints, the work supplies preregistered evidence that LLMs can generate prosocial persuasive content at scale more effectively than humans under the tested conditions. This has direct implications for charitable fundraising and scalable prosocial messaging. The preregistration and sample sizes are strengths that support the empirical contribution, though external validity hinges on whether the human-LLM comparison isolates model capability rather than differences in authoring effort.

major comments (1)

[Methods] Methods section on content generation: the procedures for producing human-authored appeals provide only high-level information and omit details on writer expertise, time allocated per appeal, compensation, and whether iterative editing was allowed. In contrast, LLM prompting is described in detail. Because the headline claim treats the comparison as direct evidence of LLM superiority, this missing information on effort parity is load-bearing and must be addressed to support attribution of the donation and persuasiveness differences to the content source itself.

minor comments (1)

[Abstract] Abstract: no statistical details, effect sizes, exclusion criteria, or checks for confounds such as prompt quality or content length are supplied, which weakens the immediate readability of the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the methodological details necessary to support our claims. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Methods] Methods section on content generation: the procedures for producing human-authored appeals provide only high-level information and omit details on writer expertise, time allocated per appeal, compensation, and whether iterative editing was allowed. In contrast, LLM prompting is described in detail. Because the headline claim treats the comparison as direct evidence of LLM superiority, this missing information on effort parity is load-bearing and must be addressed to support attribution of the donation and persuasiveness differences to the content source itself.

Authors: We agree that the current Methods section provides only high-level information on human-authored appeal generation and that greater transparency is needed to evaluate effort parity. In the revised manuscript, we will expand this section to specify that the human writers were experienced freelance content creators recruited via a standard platform, allocated up to 30 minutes per appeal, compensated at prevailing market rates for similar tasks, and instructed to produce a single draft without iterative editing or external feedback. This protocol was designed to approximate the one-shot generation used for the LLM condition. We will also add a brief discussion of how these constraints align with typical real-world production differences between human and automated content. These additions will allow readers to better assess whether the observed differences are attributable to content source rather than unequal authoring effort. revision: yes

Circularity Check

0 steps flagged

No circularity: straightforward empirical comparison with no derivations or self-referential reductions

full rationale

The paper reports results from two preregistered experiments (N=658 and N=642) that directly measure donation amounts, engagement metrics, and persuasiveness ratings for LLM-generated versus human-authored appeals under varying personalization conditions. No equations, fitted parameters, predictive models, or derivation chains appear in the abstract or described methods. Claims rest on experimental data collection rather than any self-definition, fitted-input prediction, or load-bearing self-citation of uniqueness theorems. The comparison is presented as an empirical test without reduction to prior author work or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard experimental psychology methods and statistical assumptions; no free parameters, new axioms, or invented entities are introduced.

axioms (1)

standard math Standard assumptions of experimental design and statistical inference in behavioral science (e.g., random assignment, normality for t-tests/ANOVA)
Implicit in the reporting of two preregistered experiments and their outcomes.

pith-pipeline@v0.9.0 · 5510 in / 1152 out tokens · 56072 ms · 2026-05-13T17:57:12.351188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Voelkel, Shane Muldowney, Johannes C

https://doi.org/10.1038/s41467-025-61345-5 Bajpai, S., Sameer, A., & Fatima, R. (2025). Insights into Moral Reasoning of AI: A Comparative Study Between Humans and Large Language Models. Journal of Media Ethics, 1–15. Barclay, P., & Barker, J. L. (2020). Greener than thou: People who protect the environment are more cooperative, compete to be environmenta...

work page doi:10.1038/s41467-025-61345-5 2025
[2]

https://doi.org/10.1177/0093650220961965 Zettler, I., & Strandsbjerg, C. F. (2025). Personalized interventions. Current Opinion in Psychology, 66, 102147. https://doi.org/10.1016/j.copsyc.2025.102147 Zhu, Q., Chong, L., Yang, M., & Luo, J. (2024). Reading users’ minds from what they say: An investigation into llm-based empathic mental inference. Internati...

work page doi:10.1177/0093650220961965 2025

[1] [1]

Voelkel, Shane Muldowney, Johannes C

https://doi.org/10.1038/s41467-025-61345-5 Bajpai, S., Sameer, A., & Fatima, R. (2025). Insights into Moral Reasoning of AI: A Comparative Study Between Humans and Large Language Models. Journal of Media Ethics, 1–15. Barclay, P., & Barker, J. L. (2020). Greener than thou: People who protect the environment are more cooperative, compete to be environmenta...

work page doi:10.1038/s41467-025-61345-5 2025

[2] [2]

https://doi.org/10.1177/0093650220961965 Zettler, I., & Strandsbjerg, C. F. (2025). Personalized interventions. Current Opinion in Psychology, 66, 102147. https://doi.org/10.1016/j.copsyc.2025.102147 Zhu, Q., Chong, L., Yang, M., & Luo, J. (2024). Reading users’ minds from what they say: An investigation into llm-based empathic mental inference. Internati...

work page doi:10.1177/0093650220961965 2025