pith. sign in

arxiv: 2602.04003 · v3 · pith:RL5PK6SHnew · submitted 2026-02-03 · 💻 cs.AI

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

Pith reviewed 2026-05-21 13:18 UTC · model grok-4.3

classification 💻 cs.AI
keywords adversarial explanationshuman trust in AILLM explanationstrust miscalibrationexplanation framingAI decision makingpersuasion attackscognitive security
0
0 comments X

The pith

Adversarial explanation attacks preserve nearly all user trust in incorrect AI outputs by manipulating explanation framing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how attackers can change the presentation of AI explanations to maintain human trust even when the AI's recommendation is wrong. This matters because many decisions now involve following AI advice, and fluent explanations from language models can shape that trust. The authors define adversarial explanation attacks as manipulations across four framing aspects: reasoning structure, evidence type, communication tone, and visual format. In experiments with over 200 participants, they measured trust levels and found them nearly identical for crafted explanations versus straightforward ones, even though the underlying predictions were incorrect. The preservation effect is strongest when explanations sound like expert communication on difficult tasks.

Core claim

The authors introduce adversarial explanation attacks that manipulate the framing of LLM-generated explanations to minimize the trust miscalibration gap. Human studies show users report nearly identical trust for adversarial and benign explanations, preserving the vast majority of trust despite incorrect outputs, with highest vulnerability when explanations combine authoritative evidence, neutral tone, and domain-appropriate reasoning on hard tasks in fact-driven domains.

What carries the argument

Adversarial explanation attacks that vary four dimensions of explanation framing (reasoning mode, evidence type, communication style, presentation format) to modulate human trust while keeping the incorrect prediction fixed.

If this is right

  • Trust stays high for incorrect outputs when explanations closely resemble expert communication styles.
  • Vulnerability to these attacks rises on hard tasks and in fact-driven domains.
  • Users with less formal education, younger age, or higher initial trust in AI show greater susceptibility.
  • The combination of authoritative evidence, neutral tone, and appropriate reasoning maximizes trust preservation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AI systems could incorporate checks that flag explanations with unusually persuasive framing patterns and prompt users to review the raw prediction.
  • Training users to recognize shifts in evidence type or tone might reduce the effectiveness of such attacks in real decision settings.
  • The same framing manipulations could influence trust in other automated decision tools that generate natural-language justifications.

Load-bearing premise

The four dimensions of explanation framing can be systematically varied in a controlled way that isolates their effect on trust without confounding factors from task content or participant expectations.

What would settle it

A replication study using the same tasks and participant pool that measures trust ratings and finds a drop of more than twenty percent in trust for adversarial explanations compared to benign ones would falsify the preservation claim.

Figures

Figures reproduced from arXiv: 2602.04003 by Lan Zhang, Shutong Fan, Xiaoyong Yuan.

Figure 1
Figure 1. Figure 1: Overview of the adversarial explanation generation and control, consisting of four stages: construct prompt [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Proportion of trust cognitive sources under [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ) and OLS regression confirms that the differ￾ence is not statistically significant (β = 0.06, p = .282; [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of user trust scores T across task domains and explanation strategies, including reasoning mode, evidence type, communication style, and presentation format. Baseline strategies in each dimension are marked with an asterisk (*): N (Neutral) for reasoning mode, IC (Internal Conceptual) for evidence type, NE (Neutral) for communication style, and PV (Plain Verbal) for presentation format. Reason… view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of trust scores T across task do￾mains under attacks and non-attacks. Finding. Task context matters for trust under adver￾sarial explanations: users exhibit higher trust on hard problems and in fact-driven domains, where they are more likely to defer to explanations framed as authoritative or supported by statistical evidence. 6.2.3 User-Level Traits: Cognitive and Demo￾graphic Traits We analy… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of trust scores T across task diffi￾culties under attacks and non-attacks. Task Domain. Task domain also moderates user trust. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of trust scores T across initial trust levels under attacks and non-attacks. 6.2.4 Familiarity with AI We additionally examined how users’ familiarity with AI moderates their trust in adversarial explanations. While descriptive trends suggest that less experienced users tend to retain higher trust under attack, statistical comparisons within the extreme groups: not familiar at all and expert￾l… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of trust scores T across ages under attacks and non-attacks. Initial Trust in AI. Participants with higher initial (pre￾survey) trust consistently report higher trust scores T¯ across both adversarial and benign conditions (attack: 4.6, non-attack: 5.78, [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mean trust T¯ over a sequence of tasks. 1 2 3 4 5 6 7 8 9 1011121314151617181920 Length of prior streak 2 4 6 Trust Score attack non-attack [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mean trust T¯ in the subsequent task after varying lengths of detected-attack or non-attack streaks. ial explanations, 10.1% shifted from “somewhat trust” to “neutral. By contrast, a smaller fraction of users increased trust, with 10.9% shifting from “neutral” to “somewhat trust”, possibly because some adversarial explanations appeared credible or aligned with user expectations. Strongly distrust Somewhat… view at source ↗
Figure 12
Figure 12. Figure 12: User overall trust shift in AI before and after [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt used for strategy-guided explanation generation. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt used for explanation validation. • Feature Attribution: Identifies the most influential in￾put features that contribute to the model’s decision and explains why the chosen answer uniquely satisfies key criteria [1, 29]. • Analogy & Example: Justifies the provided answer by drawing a structural parallel to a real-world example or scenario [34, 57]. • Procedural Reasoning: Presents a step-by-step, ru… view at source ↗
Figure 15
Figure 15. Figure 15: A sample task in the survey. B.2 Evidence Type Evidence type refers to the form of justification provided to support the explanation. We summarize the three evidence types as follows: • Citation & Stat-Pack: Attributes claims to verifiable ex￾ternal sources or quantitative data summaries to enhance credibility and perceived trustworthiness [19]. • Equation & Proof: Constructs formal mathematical derivatio… view at source ↗
Figure 16
Figure 16. Figure 16: Distribution of trust scores T across cognitive sources under attacks and non-attacks. To further examine how cognitive source and condition interact to shape trust, we fit an ordinary least squares (OLS) regression model predicting trust scores T from condition (attack vs. non-attack), cognitive source (ex￾planation, prior knowledge, trust in AI, other), and their interaction. The baseline is explanation… view at source ↗
Figure 17
Figure 17. Figure 17: Less-experienced users retain high trust even when explanations are adversarial, while expert users exhibit lower trust and stronger discernment. Distribution of trust scores T across AI familiarity levels under attacks and non-attacks. observed between attack and non-attack in [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Mean trust by cognitive sources (top) and [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Mean trust by cognitive sources (top) and pro [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Mean trust by cognitive sources (top) and [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Mean trust by cognitive sources (top) and [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗
read the original abstract

Most adversarial threats in artificial intelligence (AI) target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models (LLMs) generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between benign and adversarial explanations. Using this metric as a lens, we highlight a behavioral risk where persuasive explanation framing can preserve user trust even when the underlying AI prediction is wrong. To characterize this threat, we conducted a human study with over 200 participants, systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces adversarial explanation attacks (AEAs) on LLMs, in which explanation framing is manipulated to preserve human trust in incorrect AI predictions. It defines a trust miscalibration gap metric and reports a human-subject study with over 200 participants that systematically varies four framing dimensions (reasoning mode, evidence type, communication style, presentation format). The central empirical claim is that participants report nearly identical trust levels for adversarial and benign explanations, with adversarial framings preserving the vast majority of benign trust; vulnerability is reported to be highest for expert-like framings, hard tasks, fact-driven domains, and among less-educated, younger, or highly AI-trusting participants.

Significance. If the reported trust-preservation effect is robust, the work identifies a previously under-examined cognitive-layer attack surface in human-AI decision loops. The empirical mapping of framing dimensions to trust miscalibration supplies concrete evidence that persuasive but incorrect explanations can undermine appropriate reliance, with direct implications for explanation design, user-interface safeguards, and regulatory guidance on AI transparency.

major comments (2)
  1. Human study description (abstract and §4): the central claim that adversarial explanations preserve nearly all benign trust rests on the assertion that the four framing dimensions were varied while holding task content constant and neutralizing participant expectations. The manuscript provides no information on randomization procedures, pre-measures of expectations, balancing of task difficulty across conditions, or exact task domains, leaving open the possibility that observed effects are driven by content confounds rather than framing.
  2. Human study analysis (abstract and §5): no statistical tests, effect sizes, confidence intervals, or corrections for multiple comparisons are reported despite the multi-dimensional design and demographic subgroup claims. Without these details it is impossible to assess whether the 'nearly identical trust' finding is statistically supported or whether the reported demographic and task-difficulty moderators survive appropriate controls.
minor comments (2)
  1. The term 'trust miscalibration gap' is introduced without a formal equation or precise operationalization in the abstract; a short definitional paragraph or equation would improve clarity.
  2. The abstract states 'over 200 participants' but does not specify the exact N, exclusion criteria, or power analysis; adding these numbers in the methods section would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the clarity and rigor of our human-subject study. We address each major comment below and will incorporate revisions to provide the requested methodological and analytical details.

read point-by-point responses
  1. Referee: Human study description (abstract and §4): the central claim that adversarial explanations preserve nearly all benign trust rests on the assertion that the four framing dimensions were varied while holding task content constant and neutralizing participant expectations. The manuscript provides no information on randomization procedures, pre-measures of expectations, balancing of task difficulty across conditions, or exact task domains, leaving open the possibility that observed effects are driven by content confounds rather than framing.

    Authors: We acknowledge that these procedural details were not sufficiently elaborated in the submitted manuscript. The study was designed with task content held constant across conditions (only framing varied), using a within-subjects Latin-square randomization of the four framing dimensions, a pre-experiment questionnaire to assess and neutralize baseline AI expectations, and pilot-tested tasks balanced for difficulty. Exact domains included medical diagnosis and financial forecasting scenarios. In the revised version we will add a dedicated subsection in §4 with this full protocol description to rule out content confounds. revision: yes

  2. Referee: Human study analysis (abstract and §5): no statistical tests, effect sizes, confidence intervals, or corrections for multiple comparisons are reported despite the multi-dimensional design and demographic subgroup claims. Without these details it is impossible to assess whether the 'nearly identical trust' finding is statistically supported or whether the reported demographic and task-difficulty moderators survive appropriate controls.

    Authors: We agree that inferential statistics are necessary for rigorous interpretation. The original submission prioritized descriptive reporting of the trust-preservation effect; we will revise §5 to include paired t-tests (or mixed ANOVA) comparing trust scores, Cohen's d effect sizes, 95% confidence intervals, and Bonferroni corrections for the four framing dimensions plus demographic moderators. We will also add linear regression models controlling for task difficulty and participant covariates to validate the subgroup findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical human-subject study

full rationale

The paper is an empirical human-subject study that defines the trust miscalibration gap as a metric for the difference in reported trust between benign and adversarial explanations, then reports experimental results from over 200 participants across four framing dimensions. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The central findings rest on direct participant data rather than any reduction of outputs to inputs by construction, self-definition, or imported uniqueness theorems. The work is therefore self-contained as an observational study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that self-reported trust reliably captures behavioral reliance and that the chosen framing manipulations are representative of real-world LLM explanations. No free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Self-reported trust scales in a controlled online study accurately reflect real-world decision reliance on AI outputs.
    The trust miscalibration gap metric depends on this measurement assumption to quantify the effect of adversarial framing.

pith-pipeline@v0.9.0 · 5799 in / 1205 out tokens · 28502 ms · 2026-05-21T13:18:07.955346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems

    cs.HC 2026-03 unverdicted novelty 6.0

    LLM chat systems show large differences in reference quantity and quality, but users rarely click or engage with them.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Explaining individual predictions when features are dependent: More accurate approximations to shapley values.Artificial Intelligence, 2021

    Kjersti Aas, Martin Jullum, and Anders Løland. Explaining individual predictions when features are dependent: More accurate approximations to shapley values.Artificial Intelligence, 2021

  2. [2]

    plausibility: On the (un)reliability of explanations from large language models

    Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models.arXiv preprint arXiv:2402.04614, 2024

  3. [3]

    Amazon Web Services,

    Amazon Web Services, Inc.Amazon Mechani- cal Turk Documentation. Amazon Web Services,

  4. [4]

    URL: https://docs.aws.amazon.com/ AWSMechTurk/. 14

  5. [5]

    Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

    Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

  6. [6]

    Evaluating robustness of coun- terfactual explanations

    André Artelt, Valerie Vaquet, Riza Velioglu, Fabian Hinder, Johannes Brinkrolf, Malte Schilling, and Barbara Hammer. Evaluating robustness of coun- terfactual explanations. In2021 IEEE symposium series on computational intelligence (SSCI), pages 01–09. IEEE, 2021

  7. [7]

    Llms for explainable ai: A comprehensive survey.arXiv preprint arXiv:2504.00125, 2025

    Ahsan Bilal, David Ebert, and Beiyu Lin. Llms for explainable ai: A comprehensive survey.arXiv preprint arXiv:2504.00125, 2025

  8. [8]

    The impact of large language models on students: A randomised study of socratic vs

    Andrea Blasco and Vicky Charisi. The impact of large language models on students: A randomised study of socratic vs. non-socratic ai and the role of step-by-step reasoning.Non-Socratic AI and the Role of Step-by-Step Reasoning, 2024

  9. [9]

    The persuasive power of large language models

    Simon Martin Breum et al. The persuasive power of large language models. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 152–163, 2024

  10. [10]

    ELEPHANT: Measuring and understanding social sycophancy in LLMs

    Myra Cheng et al. Social sycophancy: A broader understanding of llm sycophancy, 2025. arXiv: 2505.13995

  11. [11]

    Uncertainty in xai: Human perception and modeling approaches.Machine Learning and Knowledge Extraction, 6(2), 2024

    Teodor Chiaburu, Frank Haußer, and Felix Bieß- mann. Uncertainty in xai: Human perception and modeling approaches.Machine Learning and Knowledge Extraction, 6(2), 2024

  12. [12]

    Human confidence in artificial intelligence and in themselves: The evolution and impact of confi- dence on adoption of ai advice.Computers in Human Behavior, 2022

    Leah Chong, Guanglu Zhang, Kosa Goucher- Lambert, Kenneth Kotovsky, and Jonathan Cagan. Human confidence in artificial intelligence and in themselves: The evolution and impact of confi- dence on adoption of ai advice.Computers in Human Behavior, 2022

  13. [13]

    I think i get your point, ai! the illusion of explanatory depth in explainable ai

    Michael Chromik, Malin Eiband, Felicitas Buch- ner, Adrian Krüger, and Andreas Butz. I think i get your point, ai! the illusion of explanatory depth in explainable ai. InProceedings of the 26th Inter- national Conference on Intelligent User Interfaces, pages 307–317, 2021

  14. [14]

    Faithlm: Towards faithful explanations for large language models, 2024.arXiv:2402.04678

    Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, and Xia Hu. Faithlm: Towards faithful explanations for large language models, 2024.arXiv:2402.04678

  15. [15]

    Collins New York, 2007

    Robert B Cialdini and Robert B Cialdini.Influence: The psychology of persuasion, volume 55. Collins New York, 2007

  16. [16]

    Believing anthropomor- phism: Examining the role of anthropomorphic cues on trust in large language models

    Michelle Cohn et al. Believing anthropomor- phism: Examining the role of anthropomorphic cues on trust in large language models. InEx- tended Abstracts of the CHI Conference on Hu- man Factors in Computing Systems, 2024. doi: 10.1145/3613905.3650818

  17. [17]

    An interactional account of empathy in human- machine communication.Human-Machine Com- munication, 6(1):6, 2023

    Shauna Concannon, Ian Roberts, and Marcus Toma- lin. An interactional account of empathy in human- machine communication.Human-Machine Com- munication, 6(1):6, 2023

  18. [18]

    Anwesha Das, Zekun Wu, Iza Skrjanec, and Anna Maria Feit. Shifting focus with hceye: Ex- ploring the dynamics of visual highlighting and cognitive load on user attention and saliency predic- tion.Proceedings of the ACM on Human-Computer Interaction, 8(ETRA):1–18, 2024

  19. [19]

    On generating trustworthy counterfactual explanations.Information Sciences, 2024

    Javier Del Ser, Alejandro Barredo-Arrieta, Natalia Díaz-Rodríguez, Francisco Herrera, Anna Saranti, and Andreas Holzinger. On generating trustworthy counterfactual explanations.Information Sciences, 2024

  20. [20]

    Citations and trust in llm gener- ated responses

    Yifan Ding et al. Citations and trust in llm gener- ated responses. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2025

  21. [21]

    Fluid transformers and creative analogies: Exploring large language models’ ca- pacity for augmenting cross-domain analogical cre- ativity

    Zijian Ding, Arvind Srinivasan, Stephen MacNeil, and Joel Chan. Fluid transformers and creative analogies: Exploring large language models’ ca- pacity for augmenting cross-domain analogical cre- ativity. InProceedings of the 15th Conference on Creativity and Cognition, pages 489–505, 2023

  22. [22]

    Secure human oversight of ai: Exploring the attack surface of human oversight

    Jonas C Ditz et al. Secure human oversight of ai: Exploring the attack surface of human oversight. arXiv preprint arXiv:2509.12290, 2025

  23. [23]

    Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 2022

    Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 2022

  24. [24]

    Evidence-based xai: An empirical approach to design more effective and explainable decision support systems.Computers in biology and medicine, 170(March 2024), 2024

    Lorenzo Famiglini et al. Evidence-based xai: An empirical approach to design more effective and explainable decision support systems.Computers in biology and medicine, 170(March 2024), 2024

  25. [25]

    Posi- tion: Human factors reshape adversarial analysis in human-ai decision-making systems.arXiv preprint arXiv:2509.21436, 2025

    Shutong Fan, Lan Zhang, and Xiaoyong Yuan. Posi- tion: Human factors reshape adversarial analysis in human-ai decision-making systems.arXiv preprint arXiv:2509.21436, 2025. 15

  26. [26]

    On the creativity of large language models.AI & SOCI- ETY, pages 1–11, 2024

    Giorgio Franceschelli and Mirco Musolesi. On the creativity of large language models.AI & SOCI- ETY, pages 1–11, 2024

  27. [27]

    Model inversion attacks that exploit confi- dence information and basic countermeasures

    Matt Fredrikson, Somesh Jha, and Thomas Risten- part. Model inversion attacks that exploit confi- dence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 2015

  28. [28]

    Learning complementary policies for human-ai teams.arXiv preprint arXiv:2302.02944, 2023

    Ruijiang Gao, Maytal Saar-Tsechansky, Maria De- Arteaga, Ligong Han, Wei Sun, Min Kyung Lee, and Matthew Lease. Learning complementary policies for human-ai teams.arXiv preprint arXiv:2302.02944, 2023

  29. [29]

    Explaining and harnessing adversarial examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015

  30. [30]

    A survey of methods for explaining black box models.ACM computing surveys (CSUR), 2018

    Riccardo Guidotti et al. A survey of methods for explaining black box models.ACM computing surveys (CSUR), 2018

  31. [31]

    A meta-analysis of factors affecting trust in human-robot interaction.Human factors, 2011

    Peter A Hancock et al. A meta-analysis of factors affecting trust in human-robot interaction.Human factors, 2011

  32. [32]

    Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models

    Shibo Hao et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. InFirst Conference on Language Modeling, 2024

  33. [33]

    Measuring massive multitask language understanding.Proceedings of the Inter- national Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks et al. Measuring massive multitask language understanding.Proceedings of the Inter- national Conference on Learning Representations (ICLR), 2021

  34. [34]

    Citation: A key to building responsible and accountable large language models.arXiv preprint arXiv:2307.02185, 2023

    Jie Huang and Kevin Chen-Chuan Chang. Citation: A key to building responsible and accountable large language models.arXiv preprint arXiv:2307.02185, 2023

  35. [35]

    Towards analogy-based expla- nations in machine learning

    Eyke Hüllermeier. Towards analogy-based expla- nations in machine learning. InInternational Con- ference on Modeling Decisions for Artificial Intelli- gence. Springer, 2020

  36. [36]

    GPT-4o System Card

    Aaron Hurst et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  37. [37]

    To- wards interactive evaluations for interaction harms in human-ai systems

    Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus Anderljung. To- wards interactive evaluations for interaction harms in human-ai systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1302–1310, 2025

  38. [38]

    The effects of emotions on trust in human-computer interaction: A survey and prospect.International Journal of Human– Computer Interaction, 2024

    Myounghoon Jeon. The effects of emotions on trust in human-computer interaction: A survey and prospect.International Journal of Human– Computer Interaction, 2024

  39. [39]

    Constrained high- lighting in a document reader can improve reading comprehension

    Nikhita Joshi and Daniel V ogel. Constrained high- lighting in a document reader can improve reading comprehension. InProceedings of the CHI Con- ference on Human Factors in Computing Systems, 2024

  40. [40]

    Frames, framing and reframing.Be- yond intractability, 1:1–8, 2003

    Sanda Kaufman, Michael Elliott, and Deborah Shmueli. Frames, framing and reframing.Be- yond intractability, 1:1–8, 2003

  41. [41]

    Artificial intelligence and the ongoing need for empathy, compassion and trust in healthcare.Bulletin of the World Health Organiza- tion, 98(4):245, 2020

    Angeliki Kerasidou. Artificial intelligence and the ongoing need for empathy, compassion and trust in healthcare.Bulletin of the World Health Organiza- tion, 98(4):245, 2020

  42. [42]

    how do i fool you?

    Himabindu Lakkaraju and Osbert Bastani. " how do i fool you?" manipulating user trust via misleading black box explanations. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 79–85, 2020

  43. [43]

    Polite speech strategies and their impact on drivers’ trust in au- tonomous vehicles.Computers in Human Behavior, 127:107015, 2022

    Jae-gil Lee and Kwan Min Lee. Polite speech strategies and their impact on drivers’ trust in au- tonomous vehicles.Computers in Human Behavior, 127:107015, 2022

  44. [44]

    Trust in automation: Designing for appropriate reliance.Human factors, 46(1), 2004

    John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance.Human factors, 46(1), 2004

  45. [45]

    Towards uncertainty aware task delegation and human-ai collaborative decision-making

    Min Hun Lee and Martyn Zhe Yu Tok. Towards uncertainty aware task delegation and human-ai collaborative decision-making. InProceedings of the ACM Conference on Fairness, Accountability, and Transparency, 2025

  46. [46]

    Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459– 9474, 2020

    Patrick Lewis et al. Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459– 9474, 2020

  47. [47]

    Vera Liao et al

    Q. Vera Liao et al. Questioning the ai: Informing design practices for explainable ai user experiences. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022

  48. [48]

    Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1):18, 2020

    Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1):18, 2020. 16

  49. [49]

    Strategic adversarial attacks in ai- assisted decision making to reduce human trust and reliance

    Zhuoran Lu, Zhuoyan Li, Chun-Wei Chiang, and Ming Yin. Strategic adversarial attacks in ai- assisted decision making to reduce human trust and reliance. InIJCAI, pages 3020–3028, 2023

  50. [50]

    Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro

    Andreas Madsen, Sarath Chandar, and Siva Reddy. Are self-explanations from large language models faithful?arXiv preprint arXiv:2401.07927, 2024

  51. [51]

    Sycophancy in large language models: Causes and mitigations

    Lars Malmqvist. Sycophancy in large language models: Causes and mitigations. InIntelligent Computing-Proceedings of the Computing Confer- ence, pages 61–74. Springer, 2025

  52. [52]

    Walk the talk? measuring the faithful- ness of large language model explanations

    Katie Matton, Robert Ness, John Guttag, and Emre Kiciman. Walk the talk? measuring the faithful- ness of large language model explanations. InThe Thirteenth International Conference on Learning Representations, 2025

  53. [53]

    Explanation in artificial intelligence: Insights from the social sciences.Artificial intelli- gence, 267, 2019

    Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial intelli- gence, 267, 2019

  54. [54]

    The trouble with overconfidence.Psychological review, 115(2):502, 2008

    Don A Moore and Paul J Healy. The trouble with overconfidence.Psychological review, 115(2):502, 2008

  55. [55]

    Vera and Bellamy, Rachel K

    Ramaravind K. Mothilal, Amit Sharma, and Chen- hao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. InPro- ceedings of the Conference on Fairness, Account- ability, and Transparency, page 607–617, 2020. doi:10.1145/3351095.3372850

  56. [56]

    Llms for science: Usage for code generation and data analysis.Journal of Software: Evolution and Process, 37(1), 2025

    Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, and Ingo Weber. Llms for science: Usage for code generation and data analysis.Journal of Software: Evolution and Process, 37(1), 2025

  57. [57]

    The elabora- tion likelihood model of persuasion

    Richard E Petty and John T Cacioppo. The elabora- tion likelihood model of persuasion. InAdvances in experimental social psychology, volume 19, pages 123–205. Elsevier, 1986

  58. [58]

    Natural example-based explainabil- ity: a survey

    Antonin Poché, Lucas Hervier, and Mohamed- Chafik Bakkay. Natural example-based explainabil- ity: a survey. InWorld Conference on eXplainable Artificial Intelligence, pages 24–47. Springer, 2023

  59. [59]

    The effect of framing on trust in artificial intelligence: An analysis of acceptance behavior.Available at SSRN 5008348, 2024

    Sonja Gabriele Prinz, Barbara E Weißenberger, and Peter Kotzian. The effect of framing on trust in artificial intelligence: An analysis of acceptance behavior.Available at SSRN 5008348, 2024

  60. [60]

    Qualtrics survey platform, 2025

    Qualtrics. Qualtrics survey platform, 2025. URL: https://www.qualtrics.com/

  61. [61]

    Towards human-centered explain- able ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence, 46(4):2104–2122, 2023

    Yao Rong et al. Towards human-centered explain- able ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence, 46(4):2104–2122, 2023

  62. [62]

    Talk, listen, connect: How humans and ai evaluate empathy in responses to emotionally charged narratives, 2025

    Mahnaz Roshanaei, Rezvaneh Rezapour, and Magy Seif El-Nasr. Talk, listen, connect: How humans and ai evaluate empathy in responses to emotionally charged narratives, 2025. arXiv: 2409.15550

  63. [63]

    A missing piece in the puzzle: Considering the role of task complexity in human-ai decision making

    Sara Salimzadeh, Gaole He, and Ujwal Gadiraju. A missing piece in the puzzle: Considering the role of task complexity in human-ai decision making. In Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization, 2023

  64. [64]

    On the conversational per- suasiveness of GPT-4

    Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West. On the conversa- tional persuasiveness of gpt-4.Nature Human Behaviour, 9(8):1645–1653, May 2025. doi: 10.1038/s41562-025-02194-6

  65. [65]

    Towards understanding sycophancy in language models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, et al. Towards understanding sycophancy in language models. InThe Inter- national Conference on Learning Representations, 2024

  66. [66]

    On the exploitability of instruction tuning.Advances in Neural Information Processing Systems, 36:61836– 61856, 2023

    Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geip- ing, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning.Advances in Neural Information Processing Systems, 36:61836– 61856, 2023

  67. [67]

    Wu, T., Xiang, C., Wang, J

    Judith Sieker, Simeon Junker, Ronja Utescher, Nazia Attari, Heiko Wersing, Hendrik Buschmeier, and Sina Zarrieß. The illusion of competence: Evaluating the effect of explanations on users’ men- tal models of visual question answering systems. InProceedings of the Conference on Empirical Methods in Natural Language Processing, Novem- ber 2024. doi:10.18653...

  68. [68]

    Toward expert-level medical question answering with large language models

    Karan Singhal et al. Toward expert-level medical question answering with large language models. Nature Medicine, 2025

  69. [69]

    Reliable post hoc explana- tions: Modeling uncertainty in explainability.Ad- vances in neural information processing systems, 2021

    Dylan Slack, Anna Hilgard, Sameer Singh, and Himabindu Lakkaraju. Reliable post hoc explana- tions: Modeling uncertainty in explainability.Ad- vances in neural information processing systems, 2021

  70. [70]

    What large language models know and what people think they know

    Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W 17 Mayer, and Padhraic Smyth. What large language models know and what people think they know. Nature Machine Intelligence, 7(2):221–231, 2025

  71. [71]

    The effect of highlighting on cognitive load and visual attention in multimedia learning.International Journal of Human–Computer Interaction, 2025

    Yuzhi Sun and David A Nembhard. The effect of highlighting on cognitive load and visual attention in multimedia learning.International Journal of Human–Computer Interaction, 2025

  72. [72]

    Intriguing properties of neural networks

    Christian Szegedy et al. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2014

  73. [73]

    emnlp-main.308/

    Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621– 3634, August 2021. doi:10.18653/v1/2021. findings-acl.317

  74. [74]

    Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36, 2023

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36, 2023

  75. [75]

    Show or suppress? managing input uncertainty in machine learning model explanations.Artificial Intelligence, 294:103456, 2021

    Danding Wang, Wencan Zhang, and Brian Y Lim. Show or suppress? managing input uncertainty in machine learning model explanations.Artificial Intelligence, 294:103456, 2021

  76. [76]

    When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087, 2025

    Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087, 2025

  77. [77]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35, 2022

    Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35, 2022

  78. [78]

    Naturalprover: Grounded mathematical proof generation with lan- guage models

    Sean Welleck, Jiacheng Liu, Ximing Lu, Han- naneh Hajishirzi, and Yejin Choi. Naturalprover: Grounded mathematical proof generation with lan- guage models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Sys- tems, 2022

  79. [79]

    Understanding and support- ing peer review using ai-reframed positive summary

    Chi-Lan Yang, Alarith Uhde, Naomi Yamashita, and Hideaki Kuzuoka. Understanding and support- ing peer review using ai-reframed positive summary. InProceedings of the 2025 CHI Conference on Hu- man Factors in Computing Systems, pages 1–16, 2025

  80. [80]

    Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573– 21612, 2023

    Kaiyu Yang et al. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573– 21612, 2023

Showing first 80 references.