pith. sign in

arxiv: 2605.23867 · v1 · pith:3D6QP7VQnew · submitted 2026-05-22 · 💻 cs.HC · cs.AI

Human Decision-Making with Persuasive and Narrative LLM Explanations

Pith reviewed 2026-05-25 03:04 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords LLM explanationshuman decision-makingpersuasivenessnarrative explanationsAI reliancedecision accuracyexplainable AIbehavioral experiment
0
0 comments X

The pith

LLM narrative explanations of varying persuasiveness do not improve human decision accuracy beyond a simple AI prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether narrative explanations from large language models, crafted at different levels of persuasiveness, help people reach more accurate decisions in classification tasks. A large behavioral experiment compared these explanations against simply showing the AI's prediction without any text. Persuasiveness made no meaningful difference to accuracy, yet the narratives raised people's tendency to follow the AI both when it was right and when it was wrong. Exploratory checks suggested that stronger narratives could slow responses and make it harder to spot when the AI was mistaken. The work concludes that narrative explanations carry performance tradeoffs rather than automatic gains.

Core claim

In a large-scale human behavioral experiment evaluating decision-making performance with LLM-generated narrative explanations of varying persuasiveness, the degree of persuasiveness did not meaningfully impact decision accuracy over a simple AI prediction alone. Narratives increased reliance on AI both when predictions were correct and incorrect. More persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction.

What carries the argument

Experimental manipulation of persuasiveness in LLM-generated narrative explanations and its measured effects on human accuracy, reliance, response time, and discrimination in classification tasks.

If this is right

  • Narrative explanations increase human reliance on AI predictions even when those predictions are incorrect.
  • More persuasive narratives can lengthen decision response times.
  • Persuasive narratives can reduce people's ability to distinguish correct from incorrect AI predictions.
  • Including narrative explanations with AI predictions involves tradeoffs for objective decision-making performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AI system designers may need to test whether adding any explanation is worth the risk of increased over-reliance in a given domain.
  • High-stakes settings such as medical diagnosis could benefit from withholding narrative text and showing only the raw prediction.
  • Future experiments could measure whether calibration techniques, such as showing confidence scores alongside narratives, restore discrimination ability.

Load-bearing premise

The experimental manipulation successfully varied the persuasiveness of the LLM explanations independently of other factors such as explanation length or content accuracy.

What would settle it

A direct replication in which participants achieve reliably higher decision accuracy with high-persuasiveness narratives than with low-persuasiveness ones or with AI predictions alone would falsify the central result.

Figures

Figures reproduced from arXiv: 2605.23867 by Jonathan Z. Bakdash, Laura R. Marusich, Mary Grace Kozuch Dhooghe, Murat Kantarcioglu.

Figure 1
Figure 1. Figure 1: Left: Average participant accuracy across the two dataset conditions and four explanation [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Average participant reliance rate on trials where the AI prediction was accurate [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution and mean (dark blue line) word counts for each of the three explanation types [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We measure persuasive speech using two pre-trained transformers. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Emotions measured using NRCLex Raw Emotion Scores Mohammad and Turney (2013). [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sentiment is measured as percent negative, positive, or neutral using NLTK Sentiment Intensity Analyzer Bird et al. (2009); Hutto and Gilbert (2014). Subjectivity and Polarity are measured using TextBlob Sentiment Loria (2025). Error bars are 95% confidence intervals [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Readability measured using the TextStat Python library Shivam Bansal (2025). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average word type count collected using TextBlob Tags Loria (2025). [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average adjective type count collected using TextBlob Tags Loria (2025). [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Average adverb type count collected using TextBlob Tags Loria (2025). [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example trial in the Neutral Explanation condition using the Census dataset. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sentiment analysis scores obtained from NLTK Bird et al. (2009); Hutto and Gilbert [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: We measure persuasive speech using two pre-trained transformers. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

Large language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fairly accurate predictions, but also in their ability to generate cogent narrative explanations of those predictions. Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions; however, less is known about the impact of narrative explanations on objective human decision-making performance. Here we conduct a large-scale human behavioral experiment to evaluate decision-making performance with LLM-generated narrative explanations of varying persuasiveness. We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance. We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect. Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction. Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports results from a large-scale human behavioral experiment on the effects of LLM-generated narrative explanations with varying persuasiveness on human decision accuracy and AI reliance in classification tasks. The main claims are that persuasiveness levels did not meaningfully affect decision accuracy compared to AI predictions alone, that narratives increased reliance on AI predictions whether correct or incorrect, and that more persuasive narratives may have negatively affected response times and discrimination between correct and incorrect AI predictions.

Significance. If these results hold, the work indicates potential tradeoffs in using narrative LLM explanations for decision support, as they may increase over-reliance without improving accuracy. This aligns with prior findings on explainable AI and provides empirical data on narrative forms specifically. The large-scale nature of the experiment strengthens the evidence base for understanding human-AI interaction in decision-making.

major comments (2)
  1. [Methods] Methods section: The description of the persuasiveness manipulation (including any pre-tests, validation of independence from length/accuracy, and how persuasiveness was quantified) is load-bearing for interpreting the null result on decision accuracy; without explicit evidence that the manipulation succeeded independently, the claim that persuasiveness does not impact accuracy cannot be fully evaluated.
  2. [Results] Results section: The central null finding on accuracy requires reporting of effect sizes, confidence intervals, and a power analysis or equivalence test; absence of these weakens the conclusion that persuasiveness 'did not meaningfully impact' performance.
minor comments (1)
  1. [Abstract] Abstract: Specify the participant count, number of trials, and key statistical methods to allow readers to assess the claims without needing the full text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the interpretation of our null results. We address each major comment below.

read point-by-point responses
  1. Referee: [Methods] Methods section: The description of the persuasiveness manipulation (including any pre-tests, validation of independence from length/accuracy, and how persuasiveness was quantified) is load-bearing for interpreting the null result on decision accuracy; without explicit evidence that the manipulation succeeded independently, the claim that persuasiveness does not impact accuracy cannot be fully evaluated.

    Authors: We agree that the Methods section requires expanded detail on the persuasiveness manipulation to support interpretation of the accuracy null result. The manuscript describes generation of narrative explanations at different persuasiveness levels, but we will revise to include any pre-tests performed, explicit checks confirming independence from explanation length and prediction accuracy, and the precise quantification approach (e.g., rating scales or validation metrics). This will provide the necessary evidence that the manipulation operated as intended. revision: yes

  2. Referee: [Results] Results section: The central null finding on accuracy requires reporting of effect sizes, confidence intervals, and a power analysis or equivalence test; absence of these weakens the conclusion that persuasiveness 'did not meaningfully impact' performance.

    Authors: We concur that null findings on accuracy benefit from effect sizes, confidence intervals, and equivalence testing or power analysis to substantiate claims of no meaningful impact. We will update the Results section to report appropriate effect sizes (e.g., Cohen's d) with 95% confidence intervals for accuracy comparisons across conditions, along with an equivalence test or post-hoc power analysis to strengthen the conclusion that persuasiveness levels did not meaningfully affect decision accuracy. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivation chain

full rationale

This is an empirical behavioral study reporting outcomes from a human-subjects experiment on decision accuracy and AI reliance. No mathematical derivations, equations, fitted parameters, or theoretical chains are present that could reduce any result to prior inputs by construction. All claims rest on direct experimental data collection and statistical reporting, which are self-contained against external benchmarks and do not invoke self-citation load-bearing premises or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the human behavioral experiment and the assumption that persuasiveness of narratives can be independently manipulated and measured.

axioms (1)
  • domain assumption Standard statistical assumptions for analyzing human decision data (e.g., independence of trials, appropriate error models) hold.
    Required to interpret reported effects on accuracy and reliance from the experiment.

pith-pipeline@v0.9.0 · 5761 in / 1205 out tokens · 25770 ms · 2026-05-25T03:04:29.235791+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

  1. [1]

    , booktitle=

    Akgul, Omer and Roberts, Richard and Namara, Moses and Levin, Dave and Mazurek, Michelle L. , booktitle=. Investigating Influencer. 2022 , volume=

  2. [2]

    Lumpkin , title =

    Douglas Alan Amyx and James R. Lumpkin , title =. Journal of Promotion Management , volume =. 2016 , publisher =. doi:10.1080/10496491.2016.1154920 , URL =

  3. [3]

    International Journal of Human--Computer Interaction , volume=

    Explainable artificial intelligence improves human decision-making: results from a mushroom picking experiment at a public art festival , author=. International Journal of Human--Computer Interaction , volume=. 2024 , doi =

  4. [4]

    A meta-analysis of the utility of explainable artificial intelligence in human-

    Schemmer, Max and Hemmer, Patrick and Nitsche, Maximilian and K. A meta-analysis of the utility of explainable artificial intelligence in human-. Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society , pages=

  5. [5]

    Automatica , volume=

    Ironies of automation , author=. Automatica , volume=. 1983 , publisher=

  6. [6]

    Annual Review Economics , volume=

    Persuasion: empirical evidence , author=. Annual Review Economics , volume=. 2010 , doi=

  7. [7]

    2015 , publisher=

    Persuasion: Theory and research, 3rd edition , author=. 2015 , publisher=

  8. [8]

    Scientific Reports , year=

    A meta-analysis of the persuasive power of large language models , author=. Scientific Reports , year=

  9. [9]

    2019 , isbn=

    Influence, New and Expanded The Psychology of Persuasion , author=. 2019 , isbn=

  10. [10]

    Measuring risk literacy: The

    Cokely, Edward T and Galesic, Mirta and Schulz, Eric and Ghazal, Saima and Garcia-Retamero, Rocio , journal=. Measuring risk literacy: The. 2012 , publisher=

  11. [11]

    2015 , publisher=

    De Leeuw, Joshua R , journal=. 2015 , publisher=

  12. [12]

    and Seiter, John , editor=

    Gass, R. and Seiter, John , editor=. Embracing Divergence: A Definitional Analysis of Pure and Borderline Cases of Persuasion , booktitle=. 2004 , month=jan, pages=

  13. [13]

    and Cacioppo, John T

    Petty, Richard E. and Cacioppo, John T. , editor=. Advances in Experimental Social Psychology , title=. 1986 , pages=. doi:https://doi.org/10.1016/S0065-2601(08)60214-2 , publisher=

  14. [14]

    and Thompson, Erik P

    Kruglanski, Arie W. and Thompson, Erik P. , year=. Persuasion by a Single Route: A View From the Unimodel , volume=. Psychological Inquiry , publisher=. doi:10.1207/S15327965PL100201 , number=

  15. [15]

    The heuristic model of persuasion , ISBN=

    Chaiken, Shelly , year=. The heuristic model of persuasion , ISBN=. Social influence: The Ontario symposium, Vol. 5. , publisher=

  16. [16]

    and Seiter, John S

    Gass, Robert H. and Seiter, John S. , year=. Persuasion: Social Influence and Compliance Gaining , ISBN=. doi:10.4324/9781003081388 , publisher=

  17. [17]

    The effects of message features: Content, structure, and style , ISBN=

    Shen, Lijiang and Bigsby, Elisabeth , year=. The effects of message features: Content, structure, and style , ISBN=. The

  18. [18]

    Lange, Kristian and Kühn, Simone and Filevich, Elisa , year=. ". PLOS ONE , publisher=. doi:10.1371/journal.pone.0130834 , number=

  19. [19]

    2006 , publisher=

    Data analysis using regression and multilevel/hierarchical models , author=. 2006 , publisher=

  20. [20]

    and Bakdash, Jonathan Z

    Marusich, Laura R. and Bakdash, Jonathan Z. and Zhou, Yan and Kantarcioglu, Murat , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  21. [21]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Does Explainable Artificial Intelligence Improve Human Decision-Making? , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2021 , month=. doi:10.1609/aaai.v35i8.16819 , number=

  22. [22]

    UCI Machine Learning Repository

    Dua, Dheeru and Graff, Casey. UCI Machine Learning Repository. 2017

  23. [23]

    Tell me a story!

    David Martens and James Hinns and Camille Dams and Mark Vergouwen and Theodoros Evgeniou , keywords =. Tell me a story!. Decision Support Systems , volume =. 2025 , issn =. doi:https://doi.org/10.1016/j.dss.2025.114402 , url =

  24. [24]

    arXiv preprint arXiv:2404.09329 , year=

    Large language models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments , author=. arXiv preprint arXiv:2404.09329 , year=

  25. [25]

    Persuasion with large language models: a survey , author=

  26. [26]

    arXiv preprint arXiv:2505.09662 , year=

    Large Language Models Are More Persuasive Than Incentivized Human Persuaders , author=. arXiv preprint arXiv:2505.09662 , year=

  27. [27]

    Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

    Deceptive explanations by large language models lead people to change their beliefs about misinformation more often than honest explanations , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

  28. [28]

    Nature Machine Intelligence , volume=

    What large language models know and what people think they know , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

  29. [29]

    Persuasion with Large Language Models: A Survey of Empirical Evidence, Study Methodologies, and Ethical Implications

    Rogiers, Alexander and Noels, Sander and Buyl, Maarten and Bie, Tijl De , year=. Persuasion with Large Language Models: a Survey , url=. doi:10.48550/arXiv.2411.06837 , note=

  30. [30]

    Information delivered by a chatbot has a positive impact on

    Altay, Sacha and Hacquin, Anne-Sophie and Chevallier, Coralie and Mercier, Hugo , journal=. Information delivered by a chatbot has a positive impact on. 2023 , publisher=

  31. [31]

    Durably reducing conspiracy beliefs through dialogues with

    Costello, Thomas H and Pennycook, Gordon and Rand, David G , journal=. Durably reducing conspiracy beliefs through dialogues with. 2024 , publisher=

  32. [32]

    2024 , url =

    Esin Durmus and Liane Lovitt and Alex Tamkin and Stuart Ritchie and Jack Clark and Deep Ganguli , title =. 2024 , url =

  33. [33]

    Proceedings of the International AAAI Conference on Web and Social Media , volume=

    The persuasive power of large language models , author=. Proceedings of the International AAAI Conference on Web and Social Media , volume=

  34. [34]

    How good are

    Meguellati, Elyas and Han, Lei and Bernstein, Abraham and Sadiq, Shazia and Demartini, Gianluca , booktitle=. How good are

  35. [35]

    The potential of generative

    Matz, Sandra C and Teeny, Jacob D and Vaid, Sumer S and Peters, Heinrich and Harari, Gabriella M and Cerf, Moran , journal=. The potential of generative. 2024 , publisher=

  36. [36]

    PNAS nexus , volume=

    The persuasive effects of political microtargeting in the age of generative artificial intelligence , author=. PNAS nexus , volume=. 2024 , publisher=

  37. [37]

    People devalue generative

    B. People devalue generative. Communications Psychology , volume=. 2023 , publisher=

  38. [38]

    Working with

    Karinshak, Elise and Liu, Sunny Xun and Park, Joon Sung and Hancock, Jeffrey T , journal=. Working with. 2023 , publisher=

  39. [39]

    arXiv preprint arXiv:2505.07775 , year=

    Must Read: A Systematic Survey of Computational Persuasion , author=. arXiv preprint arXiv:2505.07775 , year=

  40. [40]

    Tell me a story!

    Martens, David and Hinns, James and Dams, Camille and Vergouwen, Mark and Evgeniou, Theodoros , journal=. Tell me a story!. 2025 , publisher=

  41. [41]

    Explingo: Explaining

    Zytek, Alexandra and Pido, Sara and Alnegheimish, Sarah and Berti-Equille, Laure and Veeramachaneni, Kalyan , booktitle=. Explingo: Explaining. 2024 , organization=

  42. [42]

    Hartmann, Mareike and Du, Han and Feldhus, Nils and Kruijff-Korbayov. KI-K. 2022 , publisher=

  43. [43]

    2024 , eprint=

    XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models , author=. 2024 , eprint=

  44. [44]

    On the conversational per- suasiveness of GPT-4

    On the conversational persuasiveness of. Nature Human Behaviour , author =. 2025 , keywords =. doi:10.1038/s41562-025-02194-6 , language =

  45. [45]

    Proceedings of the National Academy of Sciences , volume=

    Evaluating the persuasive influence of political microtargeting with large language models , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  46. [46]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Zero-shot persuasive chatbots with LLM-generated strategies and information retrieval , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  47. [47]

    Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

    Escalation risks from language models in military and diplomatic decision-making , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

  48. [48]

    Proceedings of the 56th Annual ACM Symposium on Theory of Computing , pages=

    Calibrated language models must hallucinate , author=. Proceedings of the 56th Annual ACM Symposium on Theory of Computing , pages=

  49. [49]

    To rely or not to rely?

    Bo, Jessica Y and Wan, Sophia and Anderson, Ashton , booktitle=. To rely or not to rely?

  50. [50]

    2025 , publisher=

    Bai, Hui and Voelkel, Jan G and Muldowney, Shane and Eichstaedt, Johannes C and Willer, Robb , journal=. 2025 , publisher=

  51. [51]

    Steven Loria , year=

  52. [52]

    textstat 0.7.12 , howpublished =

    Shivam Bansal, Chaitanya Aggarwal , year=. textstat 0.7.12 , howpublished =

  53. [53]

    Computational intelligence , volume=

    Crowdsourcing a word--emotion association lexicon , author=. Computational intelligence , volume=. 2013 , publisher=

  54. [54]

    2024 , eprint=

    Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language , author=. 2024 , eprint=

  55. [55]

    https://aclanthology.org/2025.naacl-long.506/

    Measuring and benchmarking large language models’ capabilities to generate persuasive language , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , url = "https://aclanthology.org/2025.naacl-long.506/", pages=

  56. [56]

    2009 , publisher=

    Natural Language Processing with Python , author=. 2009 , publisher=

  57. [57]

    Hutto, Clayton and Gilbert, Eric , booktitle=

  58. [58]

    Hidden persuaders:

    Potter, Yujin and Lai, Shiyang and Kim, Junsol and Evans, James and Song, Dawn , booktitle=. Hidden persuaders: