When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

Lan Zhang; Shutong Fan; Xiaoyong Yuan

arxiv: 2602.04003 · v3 · pith:RL5PK6SHnew · submitted 2026-02-03 · 💻 cs.AI

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

Shutong Fan , Lan Zhang , Xiaoyong Yuan This is my paper

Pith reviewed 2026-05-21 13:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords adversarial explanationshuman trust in AILLM explanationstrust miscalibrationexplanation framingAI decision makingpersuasion attackscognitive security

0 comments

The pith

Adversarial explanation attacks preserve nearly all user trust in incorrect AI outputs by manipulating explanation framing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how attackers can change the presentation of AI explanations to maintain human trust even when the AI's recommendation is wrong. This matters because many decisions now involve following AI advice, and fluent explanations from language models can shape that trust. The authors define adversarial explanation attacks as manipulations across four framing aspects: reasoning structure, evidence type, communication tone, and visual format. In experiments with over 200 participants, they measured trust levels and found them nearly identical for crafted explanations versus straightforward ones, even though the underlying predictions were incorrect. The preservation effect is strongest when explanations sound like expert communication on difficult tasks.

Core claim

The authors introduce adversarial explanation attacks that manipulate the framing of LLM-generated explanations to minimize the trust miscalibration gap. Human studies show users report nearly identical trust for adversarial and benign explanations, preserving the vast majority of trust despite incorrect outputs, with highest vulnerability when explanations combine authoritative evidence, neutral tone, and domain-appropriate reasoning on hard tasks in fact-driven domains.

What carries the argument

Adversarial explanation attacks that vary four dimensions of explanation framing (reasoning mode, evidence type, communication style, presentation format) to modulate human trust while keeping the incorrect prediction fixed.

If this is right

Trust stays high for incorrect outputs when explanations closely resemble expert communication styles.
Vulnerability to these attacks rises on hard tasks and in fact-driven domains.
Users with less formal education, younger age, or higher initial trust in AI show greater susceptibility.
The combination of authoritative evidence, neutral tone, and appropriate reasoning maximizes trust preservation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI systems could incorporate checks that flag explanations with unusually persuasive framing patterns and prompt users to review the raw prediction.
Training users to recognize shifts in evidence type or tone might reduce the effectiveness of such attacks in real decision settings.
The same framing manipulations could influence trust in other automated decision tools that generate natural-language justifications.

Load-bearing premise

The four dimensions of explanation framing can be systematically varied in a controlled way that isolates their effect on trust without confounding factors from task content or participant expectations.

What would settle it

A replication study using the same tasks and participant pool that measures trust ratings and finds a drop of more than twenty percent in trust for adversarial explanations compared to benign ones would falsify the preservation claim.

Figures

Figures reproduced from arXiv: 2602.04003 by Lan Zhang, Shutong Fan, Xiaoyong Yuan.

**Figure 1.** Figure 1: Overview of the adversarial explanation generation and control, consisting of four stages: construct prompt [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Proportion of trust cognitive sources under [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: ) and OLS regression confirms that the difference is not statistically significant (β = 0.06, p = .282; [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of user trust scores T across task domains and explanation strategies, including reasoning mode, evidence type, communication style, and presentation format. Baseline strategies in each dimension are marked with an asterisk (*): N (Neutral) for reasoning mode, IC (Internal Conceptual) for evidence type, NE (Neutral) for communication style, and PV (Plain Verbal) for presentation format. Reason… view at source ↗

**Figure 6.** Figure 6: Distribution of trust scores T across task domains under attacks and non-attacks. Finding. Task context matters for trust under adversarial explanations: users exhibit higher trust on hard problems and in fact-driven domains, where they are more likely to defer to explanations framed as authoritative or supported by statistical evidence. 6.2.3 User-Level Traits: Cognitive and Demographic Traits We analy… view at source ↗

**Figure 5.** Figure 5: Distribution of trust scores T across task difficulties under attacks and non-attacks. Task Domain. Task domain also moderates user trust. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 9.** Figure 9: Distribution of trust scores T across initial trust levels under attacks and non-attacks. 6.2.4 Familiarity with AI We additionally examined how users’ familiarity with AI moderates their trust in adversarial explanations. While descriptive trends suggest that less experienced users tend to retain higher trust under attack, statistical comparisons within the extreme groups: not familiar at all and expertl… view at source ↗

**Figure 8.** Figure 8: Distribution of trust scores T across ages under attacks and non-attacks. Initial Trust in AI. Participants with higher initial (presurvey) trust consistently report higher trust scores T¯ across both adversarial and benign conditions (attack: 4.6, non-attack: 5.78, [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 10.** Figure 10: Mean trust T¯ over a sequence of tasks. 1 2 3 4 5 6 7 8 9 1011121314151617181920 Length of prior streak 2 4 6 Trust Score attack non-attack [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Mean trust T¯ in the subsequent task after varying lengths of detected-attack or non-attack streaks. ial explanations, 10.1% shifted from “somewhat trust” to “neutral. By contrast, a smaller fraction of users increased trust, with 10.9% shifting from “neutral” to “somewhat trust”, possibly because some adversarial explanations appeared credible or aligned with user expectations. Strongly distrust Somewhat… view at source ↗

**Figure 12.** Figure 12: User overall trust shift in AI before and after [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt used for strategy-guided explanation generation. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt used for explanation validation. • Feature Attribution: Identifies the most influential input features that contribute to the model’s decision and explains why the chosen answer uniquely satisfies key criteria [1, 29]. • Analogy & Example: Justifies the provided answer by drawing a structural parallel to a real-world example or scenario [34, 57]. • Procedural Reasoning: Presents a step-by-step, ru… view at source ↗

**Figure 15.** Figure 15: A sample task in the survey. B.2 Evidence Type Evidence type refers to the form of justification provided to support the explanation. We summarize the three evidence types as follows: • Citation & Stat-Pack: Attributes claims to verifiable external sources or quantitative data summaries to enhance credibility and perceived trustworthiness [19]. • Equation & Proof: Constructs formal mathematical derivatio… view at source ↗

**Figure 16.** Figure 16: Distribution of trust scores T across cognitive sources under attacks and non-attacks. To further examine how cognitive source and condition interact to shape trust, we fit an ordinary least squares (OLS) regression model predicting trust scores T from condition (attack vs. non-attack), cognitive source (explanation, prior knowledge, trust in AI, other), and their interaction. The baseline is explanation… view at source ↗

**Figure 17.** Figure 17: Less-experienced users retain high trust even when explanations are adversarial, while expert users exhibit lower trust and stronger discernment. Distribution of trust scores T across AI familiarity levels under attacks and non-attacks. observed between attack and non-attack in [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Mean trust by cognitive sources (top) and [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Mean trust by cognitive sources (top) and pro [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Mean trust by cognitive sources (top) and [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: Mean trust by cognitive sources (top) and [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

read the original abstract

Most adversarial threats in artificial intelligence (AI) target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models (LLMs) generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between benign and adversarial explanations. Using this metric as a lens, we highlight a behavioral risk where persuasive explanation framing can preserve user trust even when the underlying AI prediction is wrong. To characterize this threat, we conducted a human study with over 200 participants, systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces adversarial explanation attacks (AEAs) on LLMs, in which explanation framing is manipulated to preserve human trust in incorrect AI predictions. It defines a trust miscalibration gap metric and reports a human-subject study with over 200 participants that systematically varies four framing dimensions (reasoning mode, evidence type, communication style, presentation format). The central empirical claim is that participants report nearly identical trust levels for adversarial and benign explanations, with adversarial framings preserving the vast majority of benign trust; vulnerability is reported to be highest for expert-like framings, hard tasks, fact-driven domains, and among less-educated, younger, or highly AI-trusting participants.

Significance. If the reported trust-preservation effect is robust, the work identifies a previously under-examined cognitive-layer attack surface in human-AI decision loops. The empirical mapping of framing dimensions to trust miscalibration supplies concrete evidence that persuasive but incorrect explanations can undermine appropriate reliance, with direct implications for explanation design, user-interface safeguards, and regulatory guidance on AI transparency.

major comments (2)

Human study description (abstract and §4): the central claim that adversarial explanations preserve nearly all benign trust rests on the assertion that the four framing dimensions were varied while holding task content constant and neutralizing participant expectations. The manuscript provides no information on randomization procedures, pre-measures of expectations, balancing of task difficulty across conditions, or exact task domains, leaving open the possibility that observed effects are driven by content confounds rather than framing.
Human study analysis (abstract and §5): no statistical tests, effect sizes, confidence intervals, or corrections for multiple comparisons are reported despite the multi-dimensional design and demographic subgroup claims. Without these details it is impossible to assess whether the 'nearly identical trust' finding is statistically supported or whether the reported demographic and task-difficulty moderators survive appropriate controls.

minor comments (2)

The term 'trust miscalibration gap' is introduced without a formal equation or precise operationalization in the abstract; a short definitional paragraph or equation would improve clarity.
The abstract states 'over 200 participants' but does not specify the exact N, exclusion criteria, or power analysis; adding these numbers in the methods section would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the clarity and rigor of our human-subject study. We address each major comment below and will incorporate revisions to provide the requested methodological and analytical details.

read point-by-point responses

Referee: Human study description (abstract and §4): the central claim that adversarial explanations preserve nearly all benign trust rests on the assertion that the four framing dimensions were varied while holding task content constant and neutralizing participant expectations. The manuscript provides no information on randomization procedures, pre-measures of expectations, balancing of task difficulty across conditions, or exact task domains, leaving open the possibility that observed effects are driven by content confounds rather than framing.

Authors: We acknowledge that these procedural details were not sufficiently elaborated in the submitted manuscript. The study was designed with task content held constant across conditions (only framing varied), using a within-subjects Latin-square randomization of the four framing dimensions, a pre-experiment questionnaire to assess and neutralize baseline AI expectations, and pilot-tested tasks balanced for difficulty. Exact domains included medical diagnosis and financial forecasting scenarios. In the revised version we will add a dedicated subsection in §4 with this full protocol description to rule out content confounds. revision: yes
Referee: Human study analysis (abstract and §5): no statistical tests, effect sizes, confidence intervals, or corrections for multiple comparisons are reported despite the multi-dimensional design and demographic subgroup claims. Without these details it is impossible to assess whether the 'nearly identical trust' finding is statistically supported or whether the reported demographic and task-difficulty moderators survive appropriate controls.

Authors: We agree that inferential statistics are necessary for rigorous interpretation. The original submission prioritized descriptive reporting of the trust-preservation effect; we will revise §5 to include paired t-tests (or mixed ANOVA) comparing trust scores, Cohen's d effect sizes, 95% confidence intervals, and Bonferroni corrections for the four framing dimensions plus demographic moderators. We will also add linear regression models controlling for task difficulty and participant covariates to validate the subgroup findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical human-subject study

full rationale

The paper is an empirical human-subject study that defines the trust miscalibration gap as a metric for the difference in reported trust between benign and adversarial explanations, then reports experimental results from over 200 participants across four framing dimensions. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The central findings rest on direct participant data rather than any reduction of outputs to inputs by construction, self-definition, or imported uniqueness theorems. The work is therefore self-contained as an observational study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that self-reported trust reliably captures behavioral reliance and that the chosen framing manipulations are representative of real-world LLM explanations. No free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Self-reported trust scales in a controlled online study accurately reflect real-world decision reliance on AI outputs.
The trust miscalibration gap metric depends on this measurement assumption to quantify the effect of adversarial framing.

pith-pipeline@v0.9.0 · 5799 in / 1205 out tokens · 28502 ms · 2026-05-21T13:18:07.955346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the trust miscalibration gap as the change in user trust induced by adversarial explanation relative to the benign condition: ΔT(q,s) = E_u[T(u,q,e_A(q,s))] − E_u[T(u,q,e_B(q))].
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems
cs.HC 2026-03 unverdicted novelty 6.0

LLM chat systems show large differences in reference quantity and quality, but users rarely click or engage with them.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Explaining individual predictions when features are dependent: More accurate approximations to shapley values.Artificial Intelligence, 2021

Kjersti Aas, Martin Jullum, and Anders Løland. Explaining individual predictions when features are dependent: More accurate approximations to shapley values.Artificial Intelligence, 2021

work page 2021
[2]

plausibility: On the (un)reliability of explanations from large language models

Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models.arXiv preprint arXiv:2402.04614, 2024

work page arXiv 2024
[3]

Amazon Web Services,

Amazon Web Services, Inc.Amazon Mechani- cal Turk Documentation. Amazon Web Services,

work page
[4]

URL: https://docs.aws.amazon.com/ AWSMechTurk/. 14

work page
[5]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

work page arXiv 2025
[6]

Evaluating robustness of coun- terfactual explanations

André Artelt, Valerie Vaquet, Riza Velioglu, Fabian Hinder, Johannes Brinkrolf, Malte Schilling, and Barbara Hammer. Evaluating robustness of coun- terfactual explanations. In2021 IEEE symposium series on computational intelligence (SSCI), pages 01–09. IEEE, 2021

work page 2021
[7]

Llms for explainable ai: A comprehensive survey.arXiv preprint arXiv:2504.00125, 2025

Ahsan Bilal, David Ebert, and Beiyu Lin. Llms for explainable ai: A comprehensive survey.arXiv preprint arXiv:2504.00125, 2025

work page arXiv 2025
[8]

The impact of large language models on students: A randomised study of socratic vs

Andrea Blasco and Vicky Charisi. The impact of large language models on students: A randomised study of socratic vs. non-socratic ai and the role of step-by-step reasoning.Non-Socratic AI and the Role of Step-by-Step Reasoning, 2024

work page 2024
[9]

The persuasive power of large language models

Simon Martin Breum et al. The persuasive power of large language models. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 152–163, 2024

work page 2024
[10]

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Myra Cheng et al. Social sycophancy: A broader understanding of llm sycophancy, 2025. arXiv: 2505.13995

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Uncertainty in xai: Human perception and modeling approaches.Machine Learning and Knowledge Extraction, 6(2), 2024

Teodor Chiaburu, Frank Haußer, and Felix Bieß- mann. Uncertainty in xai: Human perception and modeling approaches.Machine Learning and Knowledge Extraction, 6(2), 2024

work page 2024
[12]

Human confidence in artificial intelligence and in themselves: The evolution and impact of confi- dence on adoption of ai advice.Computers in Human Behavior, 2022

Leah Chong, Guanglu Zhang, Kosa Goucher- Lambert, Kenneth Kotovsky, and Jonathan Cagan. Human confidence in artificial intelligence and in themselves: The evolution and impact of confi- dence on adoption of ai advice.Computers in Human Behavior, 2022

work page 2022
[13]

I think i get your point, ai! the illusion of explanatory depth in explainable ai

Michael Chromik, Malin Eiband, Felicitas Buch- ner, Adrian Krüger, and Andreas Butz. I think i get your point, ai! the illusion of explanatory depth in explainable ai. InProceedings of the 26th Inter- national Conference on Intelligent User Interfaces, pages 307–317, 2021

work page 2021
[14]

Faithlm: Towards faithful explanations for large language models, 2024.arXiv:2402.04678

Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, and Xia Hu. Faithlm: Towards faithful explanations for large language models, 2024.arXiv:2402.04678

work page arXiv 2024
[15]

Collins New York, 2007

Robert B Cialdini and Robert B Cialdini.Influence: The psychology of persuasion, volume 55. Collins New York, 2007

work page 2007
[16]

Believing anthropomor- phism: Examining the role of anthropomorphic cues on trust in large language models

Michelle Cohn et al. Believing anthropomor- phism: Examining the role of anthropomorphic cues on trust in large language models. InEx- tended Abstracts of the CHI Conference on Hu- man Factors in Computing Systems, 2024. doi: 10.1145/3613905.3650818

work page doi:10.1145/3613905.3650818 2024
[17]

An interactional account of empathy in human- machine communication.Human-Machine Com- munication, 6(1):6, 2023

Shauna Concannon, Ian Roberts, and Marcus Toma- lin. An interactional account of empathy in human- machine communication.Human-Machine Com- munication, 6(1):6, 2023

work page 2023
[18]

Anwesha Das, Zekun Wu, Iza Skrjanec, and Anna Maria Feit. Shifting focus with hceye: Ex- ploring the dynamics of visual highlighting and cognitive load on user attention and saliency predic- tion.Proceedings of the ACM on Human-Computer Interaction, 8(ETRA):1–18, 2024

work page 2024
[19]

On generating trustworthy counterfactual explanations.Information Sciences, 2024

Javier Del Ser, Alejandro Barredo-Arrieta, Natalia Díaz-Rodríguez, Francisco Herrera, Anna Saranti, and Andreas Holzinger. On generating trustworthy counterfactual explanations.Information Sciences, 2024

work page 2024
[20]

Citations and trust in llm gener- ated responses

Yifan Ding et al. Citations and trust in llm gener- ated responses. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2025

work page 2025
[21]

Fluid transformers and creative analogies: Exploring large language models’ ca- pacity for augmenting cross-domain analogical cre- ativity

Zijian Ding, Arvind Srinivasan, Stephen MacNeil, and Joel Chan. Fluid transformers and creative analogies: Exploring large language models’ ca- pacity for augmenting cross-domain analogical cre- ativity. InProceedings of the 15th Conference on Creativity and Cognition, pages 489–505, 2023

work page 2023
[22]

Secure human oversight of ai: Exploring the attack surface of human oversight

Jonas C Ditz et al. Secure human oversight of ai: Exploring the attack surface of human oversight. arXiv preprint arXiv:2509.12290, 2025

work page arXiv 2025
[23]

Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 2022

Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 2022

work page 2022
[24]

Evidence-based xai: An empirical approach to design more effective and explainable decision support systems.Computers in biology and medicine, 170(March 2024), 2024

Lorenzo Famiglini et al. Evidence-based xai: An empirical approach to design more effective and explainable decision support systems.Computers in biology and medicine, 170(March 2024), 2024

work page 2024
[25]

Posi- tion: Human factors reshape adversarial analysis in human-ai decision-making systems.arXiv preprint arXiv:2509.21436, 2025

Shutong Fan, Lan Zhang, and Xiaoyong Yuan. Posi- tion: Human factors reshape adversarial analysis in human-ai decision-making systems.arXiv preprint arXiv:2509.21436, 2025. 15

work page arXiv 2025
[26]

On the creativity of large language models.AI & SOCI- ETY, pages 1–11, 2024

Giorgio Franceschelli and Mirco Musolesi. On the creativity of large language models.AI & SOCI- ETY, pages 1–11, 2024

work page 2024
[27]

Model inversion attacks that exploit confi- dence information and basic countermeasures

Matt Fredrikson, Somesh Jha, and Thomas Risten- part. Model inversion attacks that exploit confi- dence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 2015

work page 2015
[28]

Learning complementary policies for human-ai teams.arXiv preprint arXiv:2302.02944, 2023

Ruijiang Gao, Maytal Saar-Tsechansky, Maria De- Arteaga, Ligong Han, Wei Sun, Min Kyung Lee, and Matthew Lease. Learning complementary policies for human-ai teams.arXiv preprint arXiv:2302.02944, 2023

work page arXiv 2023
[29]

Explaining and harnessing adversarial examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015

work page 2015
[30]

A survey of methods for explaining black box models.ACM computing surveys (CSUR), 2018

Riccardo Guidotti et al. A survey of methods for explaining black box models.ACM computing surveys (CSUR), 2018

work page 2018
[31]

A meta-analysis of factors affecting trust in human-robot interaction.Human factors, 2011

Peter A Hancock et al. A meta-analysis of factors affecting trust in human-robot interaction.Human factors, 2011

work page 2011
[32]

Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models

Shibo Hao et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. InFirst Conference on Language Modeling, 2024

work page 2024
[33]

Measuring massive multitask language understanding.Proceedings of the Inter- national Conference on Learning Representations (ICLR), 2021

Dan Hendrycks et al. Measuring massive multitask language understanding.Proceedings of the Inter- national Conference on Learning Representations (ICLR), 2021

work page 2021
[34]

Citation: A key to building responsible and accountable large language models.arXiv preprint arXiv:2307.02185, 2023

Jie Huang and Kevin Chen-Chuan Chang. Citation: A key to building responsible and accountable large language models.arXiv preprint arXiv:2307.02185, 2023

work page arXiv 2023
[35]

Towards analogy-based expla- nations in machine learning

Eyke Hüllermeier. Towards analogy-based expla- nations in machine learning. InInternational Con- ference on Modeling Decisions for Artificial Intelli- gence. Springer, 2020

work page 2020
[36]

GPT-4o System Card

Aaron Hurst et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

To- wards interactive evaluations for interaction harms in human-ai systems

Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus Anderljung. To- wards interactive evaluations for interaction harms in human-ai systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1302–1310, 2025

work page 2025
[38]

The effects of emotions on trust in human-computer interaction: A survey and prospect.International Journal of Human– Computer Interaction, 2024

Myounghoon Jeon. The effects of emotions on trust in human-computer interaction: A survey and prospect.International Journal of Human– Computer Interaction, 2024

work page 2024
[39]

Constrained high- lighting in a document reader can improve reading comprehension

Nikhita Joshi and Daniel V ogel. Constrained high- lighting in a document reader can improve reading comprehension. InProceedings of the CHI Con- ference on Human Factors in Computing Systems, 2024

work page 2024
[40]

Frames, framing and reframing.Be- yond intractability, 1:1–8, 2003

Sanda Kaufman, Michael Elliott, and Deborah Shmueli. Frames, framing and reframing.Be- yond intractability, 1:1–8, 2003

work page 2003
[41]

Artificial intelligence and the ongoing need for empathy, compassion and trust in healthcare.Bulletin of the World Health Organiza- tion, 98(4):245, 2020

Angeliki Kerasidou. Artificial intelligence and the ongoing need for empathy, compassion and trust in healthcare.Bulletin of the World Health Organiza- tion, 98(4):245, 2020

work page 2020
[42]

how do i fool you?

Himabindu Lakkaraju and Osbert Bastani. " how do i fool you?" manipulating user trust via misleading black box explanations. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 79–85, 2020

work page 2020
[43]

Polite speech strategies and their impact on drivers’ trust in au- tonomous vehicles.Computers in Human Behavior, 127:107015, 2022

Jae-gil Lee and Kwan Min Lee. Polite speech strategies and their impact on drivers’ trust in au- tonomous vehicles.Computers in Human Behavior, 127:107015, 2022

work page 2022
[44]

Trust in automation: Designing for appropriate reliance.Human factors, 46(1), 2004

John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance.Human factors, 46(1), 2004

work page 2004
[45]

Towards uncertainty aware task delegation and human-ai collaborative decision-making

Min Hun Lee and Martyn Zhe Yu Tok. Towards uncertainty aware task delegation and human-ai collaborative decision-making. InProceedings of the ACM Conference on Fairness, Accountability, and Transparency, 2025

work page 2025
[46]

Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459– 9474, 2020

Patrick Lewis et al. Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459– 9474, 2020

work page 2020
[47]

Vera Liao et al

Q. Vera Liao et al. Questioning the ai: Informing design practices for explainable ai user experiences. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022

work page 2022
[48]

Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1):18, 2020

Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1):18, 2020. 16

work page 2020
[49]

Strategic adversarial attacks in ai- assisted decision making to reduce human trust and reliance

Zhuoran Lu, Zhuoyan Li, Chun-Wei Chiang, and Ming Yin. Strategic adversarial attacks in ai- assisted decision making to reduce human trust and reliance. InIJCAI, pages 3020–3028, 2023

work page 2023
[50]

Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro

Andreas Madsen, Sarath Chandar, and Siva Reddy. Are self-explanations from large language models faithful?arXiv preprint arXiv:2401.07927, 2024

work page arXiv 2024
[51]

Sycophancy in large language models: Causes and mitigations

Lars Malmqvist. Sycophancy in large language models: Causes and mitigations. InIntelligent Computing-Proceedings of the Computing Confer- ence, pages 61–74. Springer, 2025

work page 2025
[52]

Walk the talk? measuring the faithful- ness of large language model explanations

Katie Matton, Robert Ness, John Guttag, and Emre Kiciman. Walk the talk? measuring the faithful- ness of large language model explanations. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[53]

Explanation in artificial intelligence: Insights from the social sciences.Artificial intelli- gence, 267, 2019

Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial intelli- gence, 267, 2019

work page 2019
[54]

The trouble with overconfidence.Psychological review, 115(2):502, 2008

Don A Moore and Paul J Healy. The trouble with overconfidence.Psychological review, 115(2):502, 2008

work page 2008
[55]

Vera and Bellamy, Rachel K

Ramaravind K. Mothilal, Amit Sharma, and Chen- hao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. InPro- ceedings of the Conference on Fairness, Account- ability, and Transparency, page 607–617, 2020. doi:10.1145/3351095.3372850

work page doi:10.1145/3351095.3372850 2020
[56]

Llms for science: Usage for code generation and data analysis.Journal of Software: Evolution and Process, 37(1), 2025

Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, and Ingo Weber. Llms for science: Usage for code generation and data analysis.Journal of Software: Evolution and Process, 37(1), 2025

work page 2025
[57]

The elabora- tion likelihood model of persuasion

Richard E Petty and John T Cacioppo. The elabora- tion likelihood model of persuasion. InAdvances in experimental social psychology, volume 19, pages 123–205. Elsevier, 1986

work page 1986
[58]

Natural example-based explainabil- ity: a survey

Antonin Poché, Lucas Hervier, and Mohamed- Chafik Bakkay. Natural example-based explainabil- ity: a survey. InWorld Conference on eXplainable Artificial Intelligence, pages 24–47. Springer, 2023

work page 2023
[59]

The effect of framing on trust in artificial intelligence: An analysis of acceptance behavior.Available at SSRN 5008348, 2024

Sonja Gabriele Prinz, Barbara E Weißenberger, and Peter Kotzian. The effect of framing on trust in artificial intelligence: An analysis of acceptance behavior.Available at SSRN 5008348, 2024

work page 2024
[60]

Qualtrics survey platform, 2025

Qualtrics. Qualtrics survey platform, 2025. URL: https://www.qualtrics.com/

work page 2025
[61]

Towards human-centered explain- able ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence, 46(4):2104–2122, 2023

Yao Rong et al. Towards human-centered explain- able ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence, 46(4):2104–2122, 2023

work page 2023
[62]

Talk, listen, connect: How humans and ai evaluate empathy in responses to emotionally charged narratives, 2025

Mahnaz Roshanaei, Rezvaneh Rezapour, and Magy Seif El-Nasr. Talk, listen, connect: How humans and ai evaluate empathy in responses to emotionally charged narratives, 2025. arXiv: 2409.15550

work page arXiv 2025
[63]

A missing piece in the puzzle: Considering the role of task complexity in human-ai decision making

Sara Salimzadeh, Gaole He, and Ujwal Gadiraju. A missing piece in the puzzle: Considering the role of task complexity in human-ai decision making. In Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization, 2023

work page 2023
[64]

On the conversational per- suasiveness of GPT-4

Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West. On the conversa- tional persuasiveness of gpt-4.Nature Human Behaviour, 9(8):1645–1653, May 2025. doi: 10.1038/s41562-025-02194-6

work page doi:10.1038/s41562-025-02194-6 2025
[65]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, et al. Towards understanding sycophancy in language models. InThe Inter- national Conference on Learning Representations, 2024

work page 2024
[66]

On the exploitability of instruction tuning.Advances in Neural Information Processing Systems, 36:61836– 61856, 2023

Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geip- ing, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning.Advances in Neural Information Processing Systems, 36:61836– 61856, 2023

work page 2023
[67]

Wu, T., Xiang, C., Wang, J

Judith Sieker, Simeon Junker, Ronja Utescher, Nazia Attari, Heiko Wersing, Hendrik Buschmeier, and Sina Zarrieß. The illusion of competence: Evaluating the effect of explanations on users’ men- tal models of visual question answering systems. InProceedings of the Conference on Empirical Methods in Natural Language Processing, Novem- ber 2024. doi:10.18653...

work page doi:10.18653/v1/2024.emnlp-main 2024
[68]

Toward expert-level medical question answering with large language models

Karan Singhal et al. Toward expert-level medical question answering with large language models. Nature Medicine, 2025

work page 2025
[69]

Reliable post hoc explana- tions: Modeling uncertainty in explainability.Ad- vances in neural information processing systems, 2021

Dylan Slack, Anna Hilgard, Sameer Singh, and Himabindu Lakkaraju. Reliable post hoc explana- tions: Modeling uncertainty in explainability.Ad- vances in neural information processing systems, 2021

work page 2021
[70]

What large language models know and what people think they know

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W 17 Mayer, and Padhraic Smyth. What large language models know and what people think they know. Nature Machine Intelligence, 7(2):221–231, 2025

work page 2025
[71]

The effect of highlighting on cognitive load and visual attention in multimedia learning.International Journal of Human–Computer Interaction, 2025

Yuzhi Sun and David A Nembhard. The effect of highlighting on cognitive load and visual attention in multimedia learning.International Journal of Human–Computer Interaction, 2025

work page 2025
[72]

Intriguing properties of neural networks

Christian Szegedy et al. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[73]

emnlp-main.308/

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621– 3634, August 2021. doi:10.18653/v1/2021. findings-acl.317

work page doi:10.18653/v1/2021 2021
[74]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36, 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[75]

Show or suppress? managing input uncertainty in machine learning model explanations.Artificial Intelligence, 294:103456, 2021

Danding Wang, Wencan Zhang, and Brian Y Lim. Show or suppress? managing input uncertainty in machine learning model explanations.Artificial Intelligence, 294:103456, 2021

work page 2021
[76]

When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087, 2025

Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087, 2025

work page arXiv 2025
[77]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35, 2022

Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35, 2022

work page 2022
[78]

Naturalprover: Grounded mathematical proof generation with lan- guage models

Sean Welleck, Jiacheng Liu, Ximing Lu, Han- naneh Hajishirzi, and Yejin Choi. Naturalprover: Grounded mathematical proof generation with lan- guage models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Sys- tems, 2022

work page 2022
[79]

Understanding and support- ing peer review using ai-reframed positive summary

Chi-Lan Yang, Alarith Uhde, Naomi Yamashita, and Hideaki Kuzuoka. Understanding and support- ing peer review using ai-reframed positive summary. InProceedings of the 2025 CHI Conference on Hu- man Factors in Computing Systems, pages 1–16, 2025

work page 2025
[80]

Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573– 21612, 2023

Kaiyu Yang et al. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573– 21612, 2023

work page 2023

Showing first 80 references.

[1] [1]

Explaining individual predictions when features are dependent: More accurate approximations to shapley values.Artificial Intelligence, 2021

Kjersti Aas, Martin Jullum, and Anders Løland. Explaining individual predictions when features are dependent: More accurate approximations to shapley values.Artificial Intelligence, 2021

work page 2021

[2] [2]

plausibility: On the (un)reliability of explanations from large language models

Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models.arXiv preprint arXiv:2402.04614, 2024

work page arXiv 2024

[3] [3]

Amazon Web Services,

Amazon Web Services, Inc.Amazon Mechani- cal Turk Documentation. Amazon Web Services,

work page

[4] [4]

URL: https://docs.aws.amazon.com/ AWSMechTurk/. 14

work page

[5] [5]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

work page arXiv 2025

[6] [6]

Evaluating robustness of coun- terfactual explanations

André Artelt, Valerie Vaquet, Riza Velioglu, Fabian Hinder, Johannes Brinkrolf, Malte Schilling, and Barbara Hammer. Evaluating robustness of coun- terfactual explanations. In2021 IEEE symposium series on computational intelligence (SSCI), pages 01–09. IEEE, 2021

work page 2021

[7] [7]

Llms for explainable ai: A comprehensive survey.arXiv preprint arXiv:2504.00125, 2025

Ahsan Bilal, David Ebert, and Beiyu Lin. Llms for explainable ai: A comprehensive survey.arXiv preprint arXiv:2504.00125, 2025

work page arXiv 2025

[8] [8]

The impact of large language models on students: A randomised study of socratic vs

Andrea Blasco and Vicky Charisi. The impact of large language models on students: A randomised study of socratic vs. non-socratic ai and the role of step-by-step reasoning.Non-Socratic AI and the Role of Step-by-Step Reasoning, 2024

work page 2024

[9] [9]

The persuasive power of large language models

Simon Martin Breum et al. The persuasive power of large language models. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 152–163, 2024

work page 2024

[10] [10]

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Myra Cheng et al. Social sycophancy: A broader understanding of llm sycophancy, 2025. arXiv: 2505.13995

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Uncertainty in xai: Human perception and modeling approaches.Machine Learning and Knowledge Extraction, 6(2), 2024

Teodor Chiaburu, Frank Haußer, and Felix Bieß- mann. Uncertainty in xai: Human perception and modeling approaches.Machine Learning and Knowledge Extraction, 6(2), 2024

work page 2024

[12] [12]

Human confidence in artificial intelligence and in themselves: The evolution and impact of confi- dence on adoption of ai advice.Computers in Human Behavior, 2022

Leah Chong, Guanglu Zhang, Kosa Goucher- Lambert, Kenneth Kotovsky, and Jonathan Cagan. Human confidence in artificial intelligence and in themselves: The evolution and impact of confi- dence on adoption of ai advice.Computers in Human Behavior, 2022

work page 2022

[13] [13]

I think i get your point, ai! the illusion of explanatory depth in explainable ai

Michael Chromik, Malin Eiband, Felicitas Buch- ner, Adrian Krüger, and Andreas Butz. I think i get your point, ai! the illusion of explanatory depth in explainable ai. InProceedings of the 26th Inter- national Conference on Intelligent User Interfaces, pages 307–317, 2021

work page 2021

[14] [14]

Faithlm: Towards faithful explanations for large language models, 2024.arXiv:2402.04678

Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, and Xia Hu. Faithlm: Towards faithful explanations for large language models, 2024.arXiv:2402.04678

work page arXiv 2024

[15] [15]

Collins New York, 2007

Robert B Cialdini and Robert B Cialdini.Influence: The psychology of persuasion, volume 55. Collins New York, 2007

work page 2007

[16] [16]

Believing anthropomor- phism: Examining the role of anthropomorphic cues on trust in large language models

Michelle Cohn et al. Believing anthropomor- phism: Examining the role of anthropomorphic cues on trust in large language models. InEx- tended Abstracts of the CHI Conference on Hu- man Factors in Computing Systems, 2024. doi: 10.1145/3613905.3650818

work page doi:10.1145/3613905.3650818 2024

[17] [17]

An interactional account of empathy in human- machine communication.Human-Machine Com- munication, 6(1):6, 2023

Shauna Concannon, Ian Roberts, and Marcus Toma- lin. An interactional account of empathy in human- machine communication.Human-Machine Com- munication, 6(1):6, 2023

work page 2023

[18] [18]

Anwesha Das, Zekun Wu, Iza Skrjanec, and Anna Maria Feit. Shifting focus with hceye: Ex- ploring the dynamics of visual highlighting and cognitive load on user attention and saliency predic- tion.Proceedings of the ACM on Human-Computer Interaction, 8(ETRA):1–18, 2024

work page 2024

[19] [19]

On generating trustworthy counterfactual explanations.Information Sciences, 2024

Javier Del Ser, Alejandro Barredo-Arrieta, Natalia Díaz-Rodríguez, Francisco Herrera, Anna Saranti, and Andreas Holzinger. On generating trustworthy counterfactual explanations.Information Sciences, 2024

work page 2024

[20] [20]

Citations and trust in llm gener- ated responses

Yifan Ding et al. Citations and trust in llm gener- ated responses. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2025

work page 2025

[21] [21]

Fluid transformers and creative analogies: Exploring large language models’ ca- pacity for augmenting cross-domain analogical cre- ativity

Zijian Ding, Arvind Srinivasan, Stephen MacNeil, and Joel Chan. Fluid transformers and creative analogies: Exploring large language models’ ca- pacity for augmenting cross-domain analogical cre- ativity. InProceedings of the 15th Conference on Creativity and Cognition, pages 489–505, 2023

work page 2023

[22] [22]

Secure human oversight of ai: Exploring the attack surface of human oversight

Jonas C Ditz et al. Secure human oversight of ai: Exploring the attack surface of human oversight. arXiv preprint arXiv:2509.12290, 2025

work page arXiv 2025

[23] [23]

Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 2022

Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 2022

work page 2022

[24] [24]

Evidence-based xai: An empirical approach to design more effective and explainable decision support systems.Computers in biology and medicine, 170(March 2024), 2024

Lorenzo Famiglini et al. Evidence-based xai: An empirical approach to design more effective and explainable decision support systems.Computers in biology and medicine, 170(March 2024), 2024

work page 2024

[25] [25]

Posi- tion: Human factors reshape adversarial analysis in human-ai decision-making systems.arXiv preprint arXiv:2509.21436, 2025

Shutong Fan, Lan Zhang, and Xiaoyong Yuan. Posi- tion: Human factors reshape adversarial analysis in human-ai decision-making systems.arXiv preprint arXiv:2509.21436, 2025. 15

work page arXiv 2025

[26] [26]

On the creativity of large language models.AI & SOCI- ETY, pages 1–11, 2024

Giorgio Franceschelli and Mirco Musolesi. On the creativity of large language models.AI & SOCI- ETY, pages 1–11, 2024

work page 2024

[27] [27]

Model inversion attacks that exploit confi- dence information and basic countermeasures

Matt Fredrikson, Somesh Jha, and Thomas Risten- part. Model inversion attacks that exploit confi- dence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 2015

work page 2015

[28] [28]

Learning complementary policies for human-ai teams.arXiv preprint arXiv:2302.02944, 2023

Ruijiang Gao, Maytal Saar-Tsechansky, Maria De- Arteaga, Ligong Han, Wei Sun, Min Kyung Lee, and Matthew Lease. Learning complementary policies for human-ai teams.arXiv preprint arXiv:2302.02944, 2023

work page arXiv 2023

[29] [29]

Explaining and harnessing adversarial examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015

work page 2015

[30] [30]

A survey of methods for explaining black box models.ACM computing surveys (CSUR), 2018

Riccardo Guidotti et al. A survey of methods for explaining black box models.ACM computing surveys (CSUR), 2018

work page 2018

[31] [31]

A meta-analysis of factors affecting trust in human-robot interaction.Human factors, 2011

Peter A Hancock et al. A meta-analysis of factors affecting trust in human-robot interaction.Human factors, 2011

work page 2011

[32] [32]

Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models

Shibo Hao et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. InFirst Conference on Language Modeling, 2024

work page 2024

[33] [33]

Measuring massive multitask language understanding.Proceedings of the Inter- national Conference on Learning Representations (ICLR), 2021

Dan Hendrycks et al. Measuring massive multitask language understanding.Proceedings of the Inter- national Conference on Learning Representations (ICLR), 2021

work page 2021

[34] [34]

Citation: A key to building responsible and accountable large language models.arXiv preprint arXiv:2307.02185, 2023

Jie Huang and Kevin Chen-Chuan Chang. Citation: A key to building responsible and accountable large language models.arXiv preprint arXiv:2307.02185, 2023

work page arXiv 2023

[35] [35]

Towards analogy-based expla- nations in machine learning

Eyke Hüllermeier. Towards analogy-based expla- nations in machine learning. InInternational Con- ference on Modeling Decisions for Artificial Intelli- gence. Springer, 2020

work page 2020

[36] [36]

GPT-4o System Card

Aaron Hurst et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

To- wards interactive evaluations for interaction harms in human-ai systems

Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus Anderljung. To- wards interactive evaluations for interaction harms in human-ai systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1302–1310, 2025

work page 2025

[38] [38]

The effects of emotions on trust in human-computer interaction: A survey and prospect.International Journal of Human– Computer Interaction, 2024

Myounghoon Jeon. The effects of emotions on trust in human-computer interaction: A survey and prospect.International Journal of Human– Computer Interaction, 2024

work page 2024

[39] [39]

Constrained high- lighting in a document reader can improve reading comprehension

Nikhita Joshi and Daniel V ogel. Constrained high- lighting in a document reader can improve reading comprehension. InProceedings of the CHI Con- ference on Human Factors in Computing Systems, 2024

work page 2024

[40] [40]

Frames, framing and reframing.Be- yond intractability, 1:1–8, 2003

Sanda Kaufman, Michael Elliott, and Deborah Shmueli. Frames, framing and reframing.Be- yond intractability, 1:1–8, 2003

work page 2003

[41] [41]

Artificial intelligence and the ongoing need for empathy, compassion and trust in healthcare.Bulletin of the World Health Organiza- tion, 98(4):245, 2020

Angeliki Kerasidou. Artificial intelligence and the ongoing need for empathy, compassion and trust in healthcare.Bulletin of the World Health Organiza- tion, 98(4):245, 2020

work page 2020

[42] [42]

how do i fool you?

Himabindu Lakkaraju and Osbert Bastani. " how do i fool you?" manipulating user trust via misleading black box explanations. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 79–85, 2020

work page 2020

[43] [43]

Polite speech strategies and their impact on drivers’ trust in au- tonomous vehicles.Computers in Human Behavior, 127:107015, 2022

Jae-gil Lee and Kwan Min Lee. Polite speech strategies and their impact on drivers’ trust in au- tonomous vehicles.Computers in Human Behavior, 127:107015, 2022

work page 2022

[44] [44]

Trust in automation: Designing for appropriate reliance.Human factors, 46(1), 2004

John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance.Human factors, 46(1), 2004

work page 2004

[45] [45]

Towards uncertainty aware task delegation and human-ai collaborative decision-making

Min Hun Lee and Martyn Zhe Yu Tok. Towards uncertainty aware task delegation and human-ai collaborative decision-making. InProceedings of the ACM Conference on Fairness, Accountability, and Transparency, 2025

work page 2025

[46] [46]

Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459– 9474, 2020

Patrick Lewis et al. Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459– 9474, 2020

work page 2020

[47] [47]

Vera Liao et al

Q. Vera Liao et al. Questioning the ai: Informing design practices for explainable ai user experiences. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022

work page 2022

[48] [48]

Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1):18, 2020

Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1):18, 2020. 16

work page 2020

[49] [49]

Strategic adversarial attacks in ai- assisted decision making to reduce human trust and reliance

Zhuoran Lu, Zhuoyan Li, Chun-Wei Chiang, and Ming Yin. Strategic adversarial attacks in ai- assisted decision making to reduce human trust and reliance. InIJCAI, pages 3020–3028, 2023

work page 2023

[50] [50]

Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro

Andreas Madsen, Sarath Chandar, and Siva Reddy. Are self-explanations from large language models faithful?arXiv preprint arXiv:2401.07927, 2024

work page arXiv 2024

[51] [51]

Sycophancy in large language models: Causes and mitigations

Lars Malmqvist. Sycophancy in large language models: Causes and mitigations. InIntelligent Computing-Proceedings of the Computing Confer- ence, pages 61–74. Springer, 2025

work page 2025

[52] [52]

Walk the talk? measuring the faithful- ness of large language model explanations

Katie Matton, Robert Ness, John Guttag, and Emre Kiciman. Walk the talk? measuring the faithful- ness of large language model explanations. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[53] [53]

Explanation in artificial intelligence: Insights from the social sciences.Artificial intelli- gence, 267, 2019

Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial intelli- gence, 267, 2019

work page 2019

[54] [54]

The trouble with overconfidence.Psychological review, 115(2):502, 2008

Don A Moore and Paul J Healy. The trouble with overconfidence.Psychological review, 115(2):502, 2008

work page 2008

[55] [55]

Vera and Bellamy, Rachel K

Ramaravind K. Mothilal, Amit Sharma, and Chen- hao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. InPro- ceedings of the Conference on Fairness, Account- ability, and Transparency, page 607–617, 2020. doi:10.1145/3351095.3372850

work page doi:10.1145/3351095.3372850 2020

[56] [56]

Llms for science: Usage for code generation and data analysis.Journal of Software: Evolution and Process, 37(1), 2025

Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, and Ingo Weber. Llms for science: Usage for code generation and data analysis.Journal of Software: Evolution and Process, 37(1), 2025

work page 2025

[57] [57]

The elabora- tion likelihood model of persuasion

Richard E Petty and John T Cacioppo. The elabora- tion likelihood model of persuasion. InAdvances in experimental social psychology, volume 19, pages 123–205. Elsevier, 1986

work page 1986

[58] [58]

Natural example-based explainabil- ity: a survey

Antonin Poché, Lucas Hervier, and Mohamed- Chafik Bakkay. Natural example-based explainabil- ity: a survey. InWorld Conference on eXplainable Artificial Intelligence, pages 24–47. Springer, 2023

work page 2023

[59] [59]

The effect of framing on trust in artificial intelligence: An analysis of acceptance behavior.Available at SSRN 5008348, 2024

Sonja Gabriele Prinz, Barbara E Weißenberger, and Peter Kotzian. The effect of framing on trust in artificial intelligence: An analysis of acceptance behavior.Available at SSRN 5008348, 2024

work page 2024

[60] [60]

Qualtrics survey platform, 2025

Qualtrics. Qualtrics survey platform, 2025. URL: https://www.qualtrics.com/

work page 2025

[61] [61]

Towards human-centered explain- able ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence, 46(4):2104–2122, 2023

Yao Rong et al. Towards human-centered explain- able ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence, 46(4):2104–2122, 2023

work page 2023

[62] [62]

Talk, listen, connect: How humans and ai evaluate empathy in responses to emotionally charged narratives, 2025

Mahnaz Roshanaei, Rezvaneh Rezapour, and Magy Seif El-Nasr. Talk, listen, connect: How humans and ai evaluate empathy in responses to emotionally charged narratives, 2025. arXiv: 2409.15550

work page arXiv 2025

[63] [63]

A missing piece in the puzzle: Considering the role of task complexity in human-ai decision making

Sara Salimzadeh, Gaole He, and Ujwal Gadiraju. A missing piece in the puzzle: Considering the role of task complexity in human-ai decision making. In Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization, 2023

work page 2023

[64] [64]

On the conversational per- suasiveness of GPT-4

Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West. On the conversa- tional persuasiveness of gpt-4.Nature Human Behaviour, 9(8):1645–1653, May 2025. doi: 10.1038/s41562-025-02194-6

work page doi:10.1038/s41562-025-02194-6 2025

[65] [65]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, et al. Towards understanding sycophancy in language models. InThe Inter- national Conference on Learning Representations, 2024

work page 2024

[66] [66]

On the exploitability of instruction tuning.Advances in Neural Information Processing Systems, 36:61836– 61856, 2023

Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geip- ing, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning.Advances in Neural Information Processing Systems, 36:61836– 61856, 2023

work page 2023

[67] [67]

Wu, T., Xiang, C., Wang, J

Judith Sieker, Simeon Junker, Ronja Utescher, Nazia Attari, Heiko Wersing, Hendrik Buschmeier, and Sina Zarrieß. The illusion of competence: Evaluating the effect of explanations on users’ men- tal models of visual question answering systems. InProceedings of the Conference on Empirical Methods in Natural Language Processing, Novem- ber 2024. doi:10.18653...

work page doi:10.18653/v1/2024.emnlp-main 2024

[68] [68]

Toward expert-level medical question answering with large language models

Karan Singhal et al. Toward expert-level medical question answering with large language models. Nature Medicine, 2025

work page 2025

[69] [69]

Reliable post hoc explana- tions: Modeling uncertainty in explainability.Ad- vances in neural information processing systems, 2021

Dylan Slack, Anna Hilgard, Sameer Singh, and Himabindu Lakkaraju. Reliable post hoc explana- tions: Modeling uncertainty in explainability.Ad- vances in neural information processing systems, 2021

work page 2021

[70] [70]

What large language models know and what people think they know

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W 17 Mayer, and Padhraic Smyth. What large language models know and what people think they know. Nature Machine Intelligence, 7(2):221–231, 2025

work page 2025

[71] [71]

The effect of highlighting on cognitive load and visual attention in multimedia learning.International Journal of Human–Computer Interaction, 2025

Yuzhi Sun and David A Nembhard. The effect of highlighting on cognitive load and visual attention in multimedia learning.International Journal of Human–Computer Interaction, 2025

work page 2025

[72] [72]

Intriguing properties of neural networks

Christian Szegedy et al. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[73] [73]

emnlp-main.308/

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621– 3634, August 2021. doi:10.18653/v1/2021. findings-acl.317

work page doi:10.18653/v1/2021 2021

[74] [74]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36, 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[75] [75]

Show or suppress? managing input uncertainty in machine learning model explanations.Artificial Intelligence, 294:103456, 2021

Danding Wang, Wencan Zhang, and Brian Y Lim. Show or suppress? managing input uncertainty in machine learning model explanations.Artificial Intelligence, 294:103456, 2021

work page 2021

[76] [76]

When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087, 2025

Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087, 2025

work page arXiv 2025

[77] [77]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35, 2022

Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35, 2022

work page 2022

[78] [78]

Naturalprover: Grounded mathematical proof generation with lan- guage models

Sean Welleck, Jiacheng Liu, Ximing Lu, Han- naneh Hajishirzi, and Yejin Choi. Naturalprover: Grounded mathematical proof generation with lan- guage models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Sys- tems, 2022

work page 2022

[79] [79]

Understanding and support- ing peer review using ai-reframed positive summary

Chi-Lan Yang, Alarith Uhde, Naomi Yamashita, and Hideaki Kuzuoka. Understanding and support- ing peer review using ai-reframed positive summary. InProceedings of the 2025 CHI Conference on Hu- man Factors in Computing Systems, pages 1–16, 2025

work page 2025

[80] [80]

Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573– 21612, 2023

Kaiyu Yang et al. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573– 21612, 2023

work page 2023