When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
Pith reviewed 2026-05-21 13:18 UTC · model grok-4.3
The pith
Adversarial explanation attacks preserve nearly all user trust in incorrect AI outputs by manipulating explanation framing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce adversarial explanation attacks that manipulate the framing of LLM-generated explanations to minimize the trust miscalibration gap. Human studies show users report nearly identical trust for adversarial and benign explanations, preserving the vast majority of trust despite incorrect outputs, with highest vulnerability when explanations combine authoritative evidence, neutral tone, and domain-appropriate reasoning on hard tasks in fact-driven domains.
What carries the argument
Adversarial explanation attacks that vary four dimensions of explanation framing (reasoning mode, evidence type, communication style, presentation format) to modulate human trust while keeping the incorrect prediction fixed.
If this is right
- Trust stays high for incorrect outputs when explanations closely resemble expert communication styles.
- Vulnerability to these attacks rises on hard tasks and in fact-driven domains.
- Users with less formal education, younger age, or higher initial trust in AI show greater susceptibility.
- The combination of authoritative evidence, neutral tone, and appropriate reasoning maximizes trust preservation.
Where Pith is reading between the lines
- AI systems could incorporate checks that flag explanations with unusually persuasive framing patterns and prompt users to review the raw prediction.
- Training users to recognize shifts in evidence type or tone might reduce the effectiveness of such attacks in real decision settings.
- The same framing manipulations could influence trust in other automated decision tools that generate natural-language justifications.
Load-bearing premise
The four dimensions of explanation framing can be systematically varied in a controlled way that isolates their effect on trust without confounding factors from task content or participant expectations.
What would settle it
A replication study using the same tasks and participant pool that measures trust ratings and finds a drop of more than twenty percent in trust for adversarial explanations compared to benign ones would falsify the preservation claim.
Figures
read the original abstract
Most adversarial threats in artificial intelligence (AI) target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models (LLMs) generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between benign and adversarial explanations. Using this metric as a lens, we highlight a behavioral risk where persuasive explanation framing can preserve user trust even when the underlying AI prediction is wrong. To characterize this threat, we conducted a human study with over 200 participants, systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces adversarial explanation attacks (AEAs) on LLMs, in which explanation framing is manipulated to preserve human trust in incorrect AI predictions. It defines a trust miscalibration gap metric and reports a human-subject study with over 200 participants that systematically varies four framing dimensions (reasoning mode, evidence type, communication style, presentation format). The central empirical claim is that participants report nearly identical trust levels for adversarial and benign explanations, with adversarial framings preserving the vast majority of benign trust; vulnerability is reported to be highest for expert-like framings, hard tasks, fact-driven domains, and among less-educated, younger, or highly AI-trusting participants.
Significance. If the reported trust-preservation effect is robust, the work identifies a previously under-examined cognitive-layer attack surface in human-AI decision loops. The empirical mapping of framing dimensions to trust miscalibration supplies concrete evidence that persuasive but incorrect explanations can undermine appropriate reliance, with direct implications for explanation design, user-interface safeguards, and regulatory guidance on AI transparency.
major comments (2)
- Human study description (abstract and §4): the central claim that adversarial explanations preserve nearly all benign trust rests on the assertion that the four framing dimensions were varied while holding task content constant and neutralizing participant expectations. The manuscript provides no information on randomization procedures, pre-measures of expectations, balancing of task difficulty across conditions, or exact task domains, leaving open the possibility that observed effects are driven by content confounds rather than framing.
- Human study analysis (abstract and §5): no statistical tests, effect sizes, confidence intervals, or corrections for multiple comparisons are reported despite the multi-dimensional design and demographic subgroup claims. Without these details it is impossible to assess whether the 'nearly identical trust' finding is statistically supported or whether the reported demographic and task-difficulty moderators survive appropriate controls.
minor comments (2)
- The term 'trust miscalibration gap' is introduced without a formal equation or precise operationalization in the abstract; a short definitional paragraph or equation would improve clarity.
- The abstract states 'over 200 participants' but does not specify the exact N, exclusion criteria, or power analysis; adding these numbers in the methods section would strengthen reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help strengthen the clarity and rigor of our human-subject study. We address each major comment below and will incorporate revisions to provide the requested methodological and analytical details.
read point-by-point responses
-
Referee: Human study description (abstract and §4): the central claim that adversarial explanations preserve nearly all benign trust rests on the assertion that the four framing dimensions were varied while holding task content constant and neutralizing participant expectations. The manuscript provides no information on randomization procedures, pre-measures of expectations, balancing of task difficulty across conditions, or exact task domains, leaving open the possibility that observed effects are driven by content confounds rather than framing.
Authors: We acknowledge that these procedural details were not sufficiently elaborated in the submitted manuscript. The study was designed with task content held constant across conditions (only framing varied), using a within-subjects Latin-square randomization of the four framing dimensions, a pre-experiment questionnaire to assess and neutralize baseline AI expectations, and pilot-tested tasks balanced for difficulty. Exact domains included medical diagnosis and financial forecasting scenarios. In the revised version we will add a dedicated subsection in §4 with this full protocol description to rule out content confounds. revision: yes
-
Referee: Human study analysis (abstract and §5): no statistical tests, effect sizes, confidence intervals, or corrections for multiple comparisons are reported despite the multi-dimensional design and demographic subgroup claims. Without these details it is impossible to assess whether the 'nearly identical trust' finding is statistically supported or whether the reported demographic and task-difficulty moderators survive appropriate controls.
Authors: We agree that inferential statistics are necessary for rigorous interpretation. The original submission prioritized descriptive reporting of the trust-preservation effect; we will revise §5 to include paired t-tests (or mixed ANOVA) comparing trust scores, Cohen's d effect sizes, 95% confidence intervals, and Bonferroni corrections for the four framing dimensions plus demographic moderators. We will also add linear regression models controlling for task difficulty and participant covariates to validate the subgroup findings. revision: yes
Circularity Check
No significant circularity in empirical human-subject study
full rationale
The paper is an empirical human-subject study that defines the trust miscalibration gap as a metric for the difference in reported trust between benign and adversarial explanations, then reports experimental results from over 200 participants across four framing dimensions. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The central findings rest on direct participant data rather than any reduction of outputs to inputs by construction, self-definition, or imported uniqueness theorems. The work is therefore self-contained as an observational study with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-reported trust scales in a controlled online study accurately reflect real-world decision reliance on AI outputs.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define the trust miscalibration gap as the change in user trust induced by adversarial explanation relative to the benign condition: ΔT(q,s) = E_u[T(u,q,e_A(q,s))] − E_u[T(u,q,e_B(q))].
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems
LLM chat systems show large differences in reference quantity and quality, but users rarely click or engage with them.
Reference graph
Works this paper leans on
-
[1]
Kjersti Aas, Martin Jullum, and Anders Løland. Explaining individual predictions when features are dependent: More accurate approximations to shapley values.Artificial Intelligence, 2021
work page 2021
-
[2]
plausibility: On the (un)reliability of explanations from large language models
Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models.arXiv preprint arXiv:2402.04614, 2024
-
[3]
Amazon Web Services, Inc.Amazon Mechani- cal Turk Documentation. Amazon Web Services,
-
[4]
URL: https://docs.aws.amazon.com/ AWSMechTurk/. 14
-
[5]
Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025
Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025
-
[6]
Evaluating robustness of coun- terfactual explanations
André Artelt, Valerie Vaquet, Riza Velioglu, Fabian Hinder, Johannes Brinkrolf, Malte Schilling, and Barbara Hammer. Evaluating robustness of coun- terfactual explanations. In2021 IEEE symposium series on computational intelligence (SSCI), pages 01–09. IEEE, 2021
work page 2021
-
[7]
Llms for explainable ai: A comprehensive survey.arXiv preprint arXiv:2504.00125, 2025
Ahsan Bilal, David Ebert, and Beiyu Lin. Llms for explainable ai: A comprehensive survey.arXiv preprint arXiv:2504.00125, 2025
-
[8]
The impact of large language models on students: A randomised study of socratic vs
Andrea Blasco and Vicky Charisi. The impact of large language models on students: A randomised study of socratic vs. non-socratic ai and the role of step-by-step reasoning.Non-Socratic AI and the Role of Step-by-Step Reasoning, 2024
work page 2024
-
[9]
The persuasive power of large language models
Simon Martin Breum et al. The persuasive power of large language models. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 152–163, 2024
work page 2024
-
[10]
ELEPHANT: Measuring and understanding social sycophancy in LLMs
Myra Cheng et al. Social sycophancy: A broader understanding of llm sycophancy, 2025. arXiv: 2505.13995
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Teodor Chiaburu, Frank Haußer, and Felix Bieß- mann. Uncertainty in xai: Human perception and modeling approaches.Machine Learning and Knowledge Extraction, 6(2), 2024
work page 2024
-
[12]
Leah Chong, Guanglu Zhang, Kosa Goucher- Lambert, Kenneth Kotovsky, and Jonathan Cagan. Human confidence in artificial intelligence and in themselves: The evolution and impact of confi- dence on adoption of ai advice.Computers in Human Behavior, 2022
work page 2022
-
[13]
I think i get your point, ai! the illusion of explanatory depth in explainable ai
Michael Chromik, Malin Eiband, Felicitas Buch- ner, Adrian Krüger, and Andreas Butz. I think i get your point, ai! the illusion of explanatory depth in explainable ai. InProceedings of the 26th Inter- national Conference on Intelligent User Interfaces, pages 307–317, 2021
work page 2021
-
[14]
Faithlm: Towards faithful explanations for large language models, 2024.arXiv:2402.04678
Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, and Xia Hu. Faithlm: Towards faithful explanations for large language models, 2024.arXiv:2402.04678
-
[15]
Robert B Cialdini and Robert B Cialdini.Influence: The psychology of persuasion, volume 55. Collins New York, 2007
work page 2007
-
[16]
Michelle Cohn et al. Believing anthropomor- phism: Examining the role of anthropomorphic cues on trust in large language models. InEx- tended Abstracts of the CHI Conference on Hu- man Factors in Computing Systems, 2024. doi: 10.1145/3613905.3650818
-
[17]
Shauna Concannon, Ian Roberts, and Marcus Toma- lin. An interactional account of empathy in human- machine communication.Human-Machine Com- munication, 6(1):6, 2023
work page 2023
-
[18]
Anwesha Das, Zekun Wu, Iza Skrjanec, and Anna Maria Feit. Shifting focus with hceye: Ex- ploring the dynamics of visual highlighting and cognitive load on user attention and saliency predic- tion.Proceedings of the ACM on Human-Computer Interaction, 8(ETRA):1–18, 2024
work page 2024
-
[19]
On generating trustworthy counterfactual explanations.Information Sciences, 2024
Javier Del Ser, Alejandro Barredo-Arrieta, Natalia Díaz-Rodríguez, Francisco Herrera, Anna Saranti, and Andreas Holzinger. On generating trustworthy counterfactual explanations.Information Sciences, 2024
work page 2024
-
[20]
Citations and trust in llm gener- ated responses
Yifan Ding et al. Citations and trust in llm gener- ated responses. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2025
work page 2025
-
[21]
Zijian Ding, Arvind Srinivasan, Stephen MacNeil, and Joel Chan. Fluid transformers and creative analogies: Exploring large language models’ ca- pacity for augmenting cross-domain analogical cre- ativity. InProceedings of the 15th Conference on Creativity and Cognition, pages 489–505, 2023
work page 2023
-
[22]
Secure human oversight of ai: Exploring the attack surface of human oversight
Jonas C Ditz et al. Secure human oversight of ai: Exploring the attack surface of human oversight. arXiv preprint arXiv:2509.12290, 2025
-
[23]
Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 2022
work page 2022
-
[24]
Lorenzo Famiglini et al. Evidence-based xai: An empirical approach to design more effective and explainable decision support systems.Computers in biology and medicine, 170(March 2024), 2024
work page 2024
-
[25]
Shutong Fan, Lan Zhang, and Xiaoyong Yuan. Posi- tion: Human factors reshape adversarial analysis in human-ai decision-making systems.arXiv preprint arXiv:2509.21436, 2025. 15
-
[26]
On the creativity of large language models.AI & SOCI- ETY, pages 1–11, 2024
Giorgio Franceschelli and Mirco Musolesi. On the creativity of large language models.AI & SOCI- ETY, pages 1–11, 2024
work page 2024
-
[27]
Model inversion attacks that exploit confi- dence information and basic countermeasures
Matt Fredrikson, Somesh Jha, and Thomas Risten- part. Model inversion attacks that exploit confi- dence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 2015
work page 2015
-
[28]
Learning complementary policies for human-ai teams.arXiv preprint arXiv:2302.02944, 2023
Ruijiang Gao, Maytal Saar-Tsechansky, Maria De- Arteaga, Ligong Han, Wei Sun, Min Kyung Lee, and Matthew Lease. Learning complementary policies for human-ai teams.arXiv preprint arXiv:2302.02944, 2023
-
[29]
Explaining and harnessing adversarial examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015
work page 2015
-
[30]
A survey of methods for explaining black box models.ACM computing surveys (CSUR), 2018
Riccardo Guidotti et al. A survey of methods for explaining black box models.ACM computing surveys (CSUR), 2018
work page 2018
-
[31]
A meta-analysis of factors affecting trust in human-robot interaction.Human factors, 2011
Peter A Hancock et al. A meta-analysis of factors affecting trust in human-robot interaction.Human factors, 2011
work page 2011
-
[32]
Shibo Hao et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. InFirst Conference on Language Modeling, 2024
work page 2024
-
[33]
Dan Hendrycks et al. Measuring massive multitask language understanding.Proceedings of the Inter- national Conference on Learning Representations (ICLR), 2021
work page 2021
-
[34]
Jie Huang and Kevin Chen-Chuan Chang. Citation: A key to building responsible and accountable large language models.arXiv preprint arXiv:2307.02185, 2023
-
[35]
Towards analogy-based expla- nations in machine learning
Eyke Hüllermeier. Towards analogy-based expla- nations in machine learning. InInternational Con- ference on Modeling Decisions for Artificial Intelli- gence. Springer, 2020
work page 2020
-
[36]
Aaron Hurst et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
To- wards interactive evaluations for interaction harms in human-ai systems
Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus Anderljung. To- wards interactive evaluations for interaction harms in human-ai systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1302–1310, 2025
work page 2025
-
[38]
Myounghoon Jeon. The effects of emotions on trust in human-computer interaction: A survey and prospect.International Journal of Human– Computer Interaction, 2024
work page 2024
-
[39]
Constrained high- lighting in a document reader can improve reading comprehension
Nikhita Joshi and Daniel V ogel. Constrained high- lighting in a document reader can improve reading comprehension. InProceedings of the CHI Con- ference on Human Factors in Computing Systems, 2024
work page 2024
-
[40]
Frames, framing and reframing.Be- yond intractability, 1:1–8, 2003
Sanda Kaufman, Michael Elliott, and Deborah Shmueli. Frames, framing and reframing.Be- yond intractability, 1:1–8, 2003
work page 2003
-
[41]
Angeliki Kerasidou. Artificial intelligence and the ongoing need for empathy, compassion and trust in healthcare.Bulletin of the World Health Organiza- tion, 98(4):245, 2020
work page 2020
-
[42]
Himabindu Lakkaraju and Osbert Bastani. " how do i fool you?" manipulating user trust via misleading black box explanations. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 79–85, 2020
work page 2020
-
[43]
Jae-gil Lee and Kwan Min Lee. Polite speech strategies and their impact on drivers’ trust in au- tonomous vehicles.Computers in Human Behavior, 127:107015, 2022
work page 2022
-
[44]
Trust in automation: Designing for appropriate reliance.Human factors, 46(1), 2004
John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance.Human factors, 46(1), 2004
work page 2004
-
[45]
Towards uncertainty aware task delegation and human-ai collaborative decision-making
Min Hun Lee and Martyn Zhe Yu Tok. Towards uncertainty aware task delegation and human-ai collaborative decision-making. InProceedings of the ACM Conference on Fairness, Accountability, and Transparency, 2025
work page 2025
-
[46]
Patrick Lewis et al. Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459– 9474, 2020
work page 2020
-
[47]
Q. Vera Liao et al. Questioning the ai: Informing design practices for explainable ai user experiences. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022
work page 2022
-
[48]
Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1):18, 2020
Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods.Entropy, 23(1):18, 2020. 16
work page 2020
-
[49]
Strategic adversarial attacks in ai- assisted decision making to reduce human trust and reliance
Zhuoran Lu, Zhuoyan Li, Chun-Wei Chiang, and Ming Yin. Strategic adversarial attacks in ai- assisted decision making to reduce human trust and reliance. InIJCAI, pages 3020–3028, 2023
work page 2023
-
[50]
Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro
Andreas Madsen, Sarath Chandar, and Siva Reddy. Are self-explanations from large language models faithful?arXiv preprint arXiv:2401.07927, 2024
-
[51]
Sycophancy in large language models: Causes and mitigations
Lars Malmqvist. Sycophancy in large language models: Causes and mitigations. InIntelligent Computing-Proceedings of the Computing Confer- ence, pages 61–74. Springer, 2025
work page 2025
-
[52]
Walk the talk? measuring the faithful- ness of large language model explanations
Katie Matton, Robert Ness, John Guttag, and Emre Kiciman. Walk the talk? measuring the faithful- ness of large language model explanations. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[53]
Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial intelli- gence, 267, 2019
work page 2019
-
[54]
The trouble with overconfidence.Psychological review, 115(2):502, 2008
Don A Moore and Paul J Healy. The trouble with overconfidence.Psychological review, 115(2):502, 2008
work page 2008
-
[55]
Ramaravind K. Mothilal, Amit Sharma, and Chen- hao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. InPro- ceedings of the Conference on Fairness, Account- ability, and Transparency, page 607–617, 2020. doi:10.1145/3351095.3372850
-
[56]
Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, and Ingo Weber. Llms for science: Usage for code generation and data analysis.Journal of Software: Evolution and Process, 37(1), 2025
work page 2025
-
[57]
The elabora- tion likelihood model of persuasion
Richard E Petty and John T Cacioppo. The elabora- tion likelihood model of persuasion. InAdvances in experimental social psychology, volume 19, pages 123–205. Elsevier, 1986
work page 1986
-
[58]
Natural example-based explainabil- ity: a survey
Antonin Poché, Lucas Hervier, and Mohamed- Chafik Bakkay. Natural example-based explainabil- ity: a survey. InWorld Conference on eXplainable Artificial Intelligence, pages 24–47. Springer, 2023
work page 2023
-
[59]
Sonja Gabriele Prinz, Barbara E Weißenberger, and Peter Kotzian. The effect of framing on trust in artificial intelligence: An analysis of acceptance behavior.Available at SSRN 5008348, 2024
work page 2024
-
[60]
Qualtrics survey platform, 2025
Qualtrics. Qualtrics survey platform, 2025. URL: https://www.qualtrics.com/
work page 2025
-
[61]
Yao Rong et al. Towards human-centered explain- able ai: A survey of user studies for model explana- tions.IEEE transactions on pattern analysis and machine intelligence, 46(4):2104–2122, 2023
work page 2023
-
[62]
Mahnaz Roshanaei, Rezvaneh Rezapour, and Magy Seif El-Nasr. Talk, listen, connect: How humans and ai evaluate empathy in responses to emotionally charged narratives, 2025. arXiv: 2409.15550
-
[63]
A missing piece in the puzzle: Considering the role of task complexity in human-ai decision making
Sara Salimzadeh, Gaole He, and Ujwal Gadiraju. A missing piece in the puzzle: Considering the role of task complexity in human-ai decision making. In Proceedings of the 31st ACM Conference on User Modeling, Adaptation and Personalization, 2023
work page 2023
-
[64]
On the conversational per- suasiveness of GPT-4
Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West. On the conversa- tional persuasiveness of gpt-4.Nature Human Behaviour, 9(8):1645–1653, May 2025. doi: 10.1038/s41562-025-02194-6
-
[65]
Towards understanding sycophancy in language models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, et al. Towards understanding sycophancy in language models. InThe Inter- national Conference on Learning Representations, 2024
work page 2024
-
[66]
Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geip- ing, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning.Advances in Neural Information Processing Systems, 36:61836– 61856, 2023
work page 2023
-
[67]
Judith Sieker, Simeon Junker, Ronja Utescher, Nazia Attari, Heiko Wersing, Hendrik Buschmeier, and Sina Zarrieß. The illusion of competence: Evaluating the effect of explanations on users’ men- tal models of visual question answering systems. InProceedings of the Conference on Empirical Methods in Natural Language Processing, Novem- ber 2024. doi:10.18653...
-
[68]
Toward expert-level medical question answering with large language models
Karan Singhal et al. Toward expert-level medical question answering with large language models. Nature Medicine, 2025
work page 2025
-
[69]
Dylan Slack, Anna Hilgard, Sameer Singh, and Himabindu Lakkaraju. Reliable post hoc explana- tions: Modeling uncertainty in explainability.Ad- vances in neural information processing systems, 2021
work page 2021
-
[70]
What large language models know and what people think they know
Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W 17 Mayer, and Padhraic Smyth. What large language models know and what people think they know. Nature Machine Intelligence, 7(2):221–231, 2025
work page 2025
-
[71]
Yuzhi Sun and David A Nembhard. The effect of highlighting on cognitive load and visual attention in multimedia learning.International Journal of Human–Computer Interaction, 2025
work page 2025
-
[72]
Intriguing properties of neural networks
Christian Szegedy et al. Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[73]
Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621– 3634, August 2021. doi:10.18653/v1/2021. findings-acl.317
-
[74]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[75]
Danding Wang, Wencan Zhang, and Brian Y Lim. Show or suppress? managing input uncertainty in machine learning model explanations.Artificial Intelligence, 294:103456, 2021
work page 2021
-
[76]
Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087, 2025
-
[77]
Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35, 2022
work page 2022
-
[78]
Naturalprover: Grounded mathematical proof generation with lan- guage models
Sean Welleck, Jiacheng Liu, Ximing Lu, Han- naneh Hajishirzi, and Yejin Choi. Naturalprover: Grounded mathematical proof generation with lan- guage models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Sys- tems, 2022
work page 2022
-
[79]
Understanding and support- ing peer review using ai-reframed positive summary
Chi-Lan Yang, Alarith Uhde, Naomi Yamashita, and Hideaki Kuzuoka. Understanding and support- ing peer review using ai-reframed positive summary. InProceedings of the 2025 CHI Conference on Hu- man Factors in Computing Systems, pages 1–16, 2025
work page 2025
-
[80]
Kaiyu Yang et al. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573– 21612, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.