pith. sign in

arxiv: 2509.08010 · v2 · pith:ZI7I3C2Jnew · submitted 2025-09-08 · 💻 cs.CY · cs.AI· cs.CL· cs.HC

Measuring and mitigating overreliance to build human-compatible AI

Pith reviewed 2026-05-21 22:33 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.HC
keywords overreliancelarge language modelshuman-AI collaborationcognitive biasesAI safetymeasurement methodsmitigation strategies
0
0 comments X

The pith

Measuring and mitigating overreliance must become central to LLM research and deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models function as collaborative thought partners that engage fluidly in natural language on a range of tasks. This sets them apart from earlier technologies and raises the risk of overreliance, where people depend on the models beyond their actual capabilities. The paper consolidates individual and societal risks including high-stakes errors, governance challenges, and cognitive deskilling. It reviews historical measurement approaches, identifies three gaps, and proposes new directions along with mitigation strategies to ensure LLMs augment rather than undermine human capabilities.

Core claim

Large language models distinguish themselves from previous technologies by functioning as collaborative thought partners capable of engaging more fluidly in natural language on a range of tasks. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance grows. This paper argues that measuring and mitigating overreliance must become central to LLM research and deployment because LLM characteristics, system design features, and user cognitive biases together raise serious and unique concerns about overreliance in practice.

What carries the argument

Overreliance, defined as relying on LLMs beyond their capabilities, carried by the combined effects of LLM characteristics as fluid natural-language thought partners, system design features, and user cognitive biases.

Load-bearing premise

The premise that LLM characteristics, system design features, and user cognitive biases together raise serious and unique concerns about overreliance that prior technologies did not.

What would settle it

A controlled study finding comparable rates of overreliance and comparable downstream harms when users interact with LLMs versus earlier technologies such as web search tools or rule-based decision aids on matched tasks would undermine the claim of unique concerns.

Figures

Figures reproduced from arXiv: 2509.08010 by Alia El Kattan, Andrew Strait, Anka Reuel, Diyi Yang, Ilia Sucholutsky, Katherine M. Collins, Kevin Feng, Lama Ahmad, Lujain Ibrahim, Max Lamparth, Merlin Stein, Prajna Soni, Q. Vera Liao, Siddharth Swaroop, Sunnie S. Y. Kim, Umang Bhatt, Vishakh Padmakumar.

Figure 1
Figure 1. Figure 1: Summary of LLM characteristics, measurement challenges, and promising directions for [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative ``thought partners,'' capable of engaging more fluidly in natural language on a range of tasks. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance -- relying on LLMs beyond their capabilities -- grows. This paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that together raise serious and unique concerns about overreliance on LLMs in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that can be pursued to ensure LLMs augment rather than undermine human capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that overreliance on LLMs—relying on them beyond their capabilities—poses serious individual and societal risks (high-stakes errors, governance challenges, cognitive deskilling) that are qualitatively distinct from prior technologies due to LLMs' fluid natural-language collaboration. It consolidates these risks, attributes them to LLM characteristics, system design features, and user cognitive biases, reviews historical measurement approaches to identify three gaps, proposes three new measurement directions, and outlines mitigation strategies to ensure LLMs augment rather than replace human capabilities.

Significance. If the uniqueness premise holds and the proposed measurement directions can be operationalized, this position paper could usefully shift priorities in human-AI interaction and AI alignment research toward systematic evaluation of reliance behaviors. The consolidation of risks across domains and explicit identification of measurement gaps provide a clear agenda for subsequent empirical studies.

major comments (2)
  1. [Section 3] Section 3: The assertion that LLM traits, design features, and cognitive biases 'raise serious and unique concerns about overreliance' (abstract paragraph 2) rests on illustrative examples rather than comparative incidence rates, error-severity metrics, or longitudinal deskilling data against baselines such as rule-based expert systems, web search, or GPS navigation. This gap directly undermines the load-bearing premise that these issues warrant elevating measurement and mitigation to a central research priority.
  2. [Section 2] Section 2: The consolidation of individual and societal risks is logically structured but lacks quantitative contrasts (e.g., error rates or deskilling trajectories) with historical technologies, leaving the claim that LLM overreliance introduces qualitatively new governance and capability-undermining challenges without sufficient empirical anchoring for the central argument.
minor comments (1)
  1. [Abstract] The abstract would benefit from briefly enumerating the three proposed measurement directions and the main mitigation strategies to improve reader orientation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We appreciate the recognition that the paper could usefully shift priorities in human-AI interaction research if the uniqueness premise holds and the proposed directions are operationalized. As a position paper, our goal is to consolidate risks, identify measurement gaps, and outline an agenda rather than deliver new comparative empirical data. We address the major comments point by point below and have revised the manuscript to clarify the scope and nature of our arguments.

read point-by-point responses
  1. Referee: [Section 3] Section 3: The assertion that LLM traits, design features, and cognitive biases 'raise serious and unique concerns about overreliance' (abstract paragraph 2) rests on illustrative examples rather than comparative incidence rates, error-severity metrics, or longitudinal deskilling data against baselines such as rule-based expert systems, web search, or GPS navigation. This gap directly undermines the load-bearing premise that these issues warrant elevating measurement and mitigation to a central research priority.

    Authors: We agree that the paper relies on illustrative examples and qualitative distinctions rather than new comparative quantitative data. The central claim for uniqueness rests on LLMs' capacity for fluid, open-ended natural-language collaboration, which enables forms of interaction and potential cognitive integration not present in rule-based systems, search engines, or navigation tools. This distinction is drawn from existing human-AI interaction literature rather than asserted as proven by new metrics. We acknowledge the absence of direct incidence-rate or longitudinal comparisons as a limitation of the current evidence base. In the revised manuscript we have added explicit language in Section 3 stating that the uniqueness argument is a hypothesis to be tested through the measurement directions we propose, rather than a claim supported by new comparative data. This is a partial revision that clarifies scope without changing the position paper's core contribution. revision: partial

  2. Referee: [Section 2] Section 2: The consolidation of individual and societal risks is logically structured but lacks quantitative contrasts (e.g., error rates or deskilling trajectories) with historical technologies, leaving the claim that LLM overreliance introduces qualitatively new governance and capability-undermining challenges without sufficient empirical anchoring for the central argument.

    Authors: We accept that the risk consolidation would be strengthened by quantitative contrasts with prior technologies. The paper synthesizes risks reported across domains and attributes them to LLM-specific characteristics, but does not conduct or cite new comparative error-rate or deskilling analyses. Such direct contrasts remain limited in the literature precisely because LLMs are recent; this scarcity is one of the measurement gaps the paper identifies. We have revised Section 2 to include a short discussion noting the difficulty of apples-to-apples comparisons and explaining how the three proposed measurement directions are intended to generate the empirical anchors needed for future governance and deskilling studies. This partial revision improves anchoring while preserving the paper's focus on agenda-setting. revision: partial

Circularity Check

0 steps flagged

No significant circularity; position paper relies on external citations and observations

full rationale

The paper is a position and review piece that consolidates individual/societal risks, attributes concerns to LLM traits and design features via illustrative examples, reviews historical measurement approaches from prior literature, identifies gaps, and proposes mitigation strategies. No equations, fitted parameters, self-definitional constructs, or predictions appear in the provided abstract or described structure. The central argument draws on cited historical methods and domain observations rather than reducing to self-referential inputs or self-citation chains by construction. This is a standard self-contained review format with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on domain assumptions about LLM capabilities and user behavior drawn from the abstract; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption LLMs function as collaborative thought partners capable of engaging fluidly in natural language on a range of tasks
    Stated in the first sentence of the abstract as the distinguishing feature of LLMs.
  • domain assumption Overreliance on LLMs creates distinct individual and societal risks not fully addressed by prior technologies
    Invoked when the abstract consolidates risks and states that LLM characteristics raise unique concerns.

pith-pipeline@v0.9.0 · 5777 in / 1216 out tokens · 31509 ms · 2026-05-21T22:33:23.641695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The efficiency-gain illusion: People underestimate the rate of AI use and overestimate its benefits on simple tasks

    cs.CY 2026-05 accept novelty 6.0

    Three pre-registered studies with 2691 participants show people underestimate their AI usage rate and overestimate efficiency gains on simple tasks, with prior use entrenching further adoption.

  2. Resume-ing Control: (Mis)Perceptions of Agency Around GenAI Use in Recruiting Workflows

    cs.CY 2026-04 unverdicted novelty 5.0

    Recruiters perceive themselves as retaining agency over GenAI in hiring pipelines, yet GenAI invisibly architects core evaluation inputs, producing only marginal efficiency gains at the cost of deskilling.

Reference graph

Works this paper leans on

136 extracted references · 136 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Mirages: On anthropomorphism in dialogue systems.arXiv preprint arXiv:2305.09800, 2023

    Gavin Abercrombie, Amanda Cercas Curry, Tanvi Dinkar, Verena Rieser, and Zeerak Talat. Mirages: On anthropomorphism in dialogue systems.arXiv preprint arXiv:2305.09800, 2023

  2. [2]

    Incident 838: Microsoft Copilot Allegedly Provides Unsafe Medical Advice with High Risk of Severe Harm — incidentdatabase.ai

    AIID. Incident 838: Microsoft Copilot Allegedly Provides Unsafe Medical Advice with High Risk of Severe Harm — incidentdatabase.ai. https://incidentdatabase.ai/cite/838/,

  3. [3]

    [Accessed 09-05-2025]

  4. [4]

    Allen, C.I

    J.E. Allen, C.I. Guinn, and E. Horvtz. Mixed-initiative interaction.IEEE Intelligent Systems and their Applications, 14(5):14–23, 1999

  5. [5]

    Guidelines for human-AI interaction

    Saleema Amershi, Dan Weld, Mihaela V orvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, et al. Guidelines for human-AI interaction. InProceedings of the 2019 chi conference on human factors in computing systems, pages 1–13, 2019

  6. [6]

    How AI ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment.arXiv preprint arXiv:2401.13481, 2024

    Joshua Ashkinaze, Julia Mendelsohn, Qiwei Li, Ceren Budak, and Eric Gilbert. How AI ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment.arXiv preprint arXiv:2401.13481, 2024

  7. [7]

    Artificial intelligence and machine learning in finance: Key concepts, ap- plications, and regulatory considerations

    Alessio Azzutti. Artificial intelligence and machine learning in finance: Key concepts, ap- plications, and regulatory considerations. InThe Emerald Handbook of Fintech: Reshaping Finance, pages 315–339. Emerald Publishing Limited, 2024

  8. [8]

    An empirical exploration of trust dynamics in llm supply chains.arXiv preprint arXiv:2405.16310, 2024

    Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju. An empirical exploration of trust dynamics in llm supply chains.arXiv preprint arXiv:2405.16310, 2024

  9. [9]

    Algorithm overdependence: How the use of algorithmic recommendation systems can increase risks to consumer well-being.Journal of Public Policy & Marketing, 38(4):500–515, 2019

    Sachin Banker and Salil Khetani. Algorithm overdependence: How the use of algorithmic recommendation systems can increase risks to consumer well-being.Journal of Public Policy & Marketing, 38(4):500–515, 2019

  10. [10]

    Feedbacklogs: Recording and incorporating stakeholder feedback into machine learning pipelines

    Matthew Barker, Emma Kallina, Dhananjay Ashok, Katherine Collins, Ashley Casovan, Adrian Weller, Ameet Talwalkar, Valerie Chen, and Umang Bhatt. Feedbacklogs: Recording and incorporating stakeholder feedback into machine learning pipelines. InProceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–15, 2023

  11. [11]

    On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

    Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

  12. [12]

    Learning personalized decision support policies

    Umang Bhatt, Valerie Chen, Katherine M Collins, Parameswaran Kamalaruban, Emma Kallina, Adrian Weller, and Ameet Talwalkar. Learning personalized decision support policies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14203–14211, 2025

  13. [13]

    When should algorithms resign? a proposal for AI gover- nance.Computer, 57(10):99–103, 2024

    Umang Bhatt and Holli Sargeant. When should algorithms resign? a proposal for AI gover- nance.Computer, 57(10):99–103, 2024. 10

  14. [14]

    To rely or not to rely? evaluating interven- tions for appropriate reliance on large language models.arXiv preprint arXiv:2412.15584, 2024

    Jessica Y Bo, Sophia Wan, and Ashton Anderson. To rely or not to rely? evaluating interven- tions for appropriate reliance on large language models.arXiv preprint arXiv:2412.15584, 2024

  15. [15]

    Silvia Bonaccio and Reeshad S. Dalal. Advice taking and decision-making: An integrative literature review, and implications for the organizational sciences.Organizational Behavior and Human Decision Processes, 101(2):127–151, 2006

  16. [16]

    We need an interventionist mindset, Mar 2025

    danah boyd. We need an interventionist mindset, Mar 2025

  17. [17]

    Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

    Levin Brinkmann, Fabian Baumann, Jean-François Bonnefon, Maxime Derex, Thomas F Müller, Anne-Marie Nussberger, Agnieszka Czaplicka, Alberto Acerbi, Thomas L Griffiths, Joseph Henrich, et al. Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

  18. [18]

    Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z Gajos. To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making.Proceedings of the ACM on Human-computer Interaction, 5(CSCW1):1–21, 2021

  19. [19]

    The need for cognition.Journal of personality and social psychology, 42(1):116, 1982

    John T Cacioppo and Richard E Petty. The need for cognition.Journal of personality and social psychology, 42(1):116, 1982

  20. [20]

    Understanding user reliance on AI in assisted decision- making.Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–23, 2022

    Shiye Cao and Chien-Ming Huang. Understanding user reliance on AI in assisted decision- making.Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–23, 2022

  21. [21]

    Pitfalls of evidence-based AI policy.arXiv preprint arXiv:2502.09618, 2025

    Stephen Casper, David Krueger, and Dylan Hadfield-Menell. Pitfalls of evidence-based AI policy.arXiv preprint arXiv:2502.09618, 2025

  22. [22]

    Kevin Castel

    P. Kevin Castel. Mata v. avianca, inc. United States District Court, Southern District of New York, June 2023. No. 1:2022cv01461, Document 54 (S.D.N.Y . 2023)

  23. [23]

    Harms from increasingly agentic algorithmic systems

    Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, Micah Carroll, et al. Harms from increasingly agentic algorithmic systems. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 651–666, 2023

  24. [24]

    Probabilistic biases meet the bayesian brain.Current Directions in Psychological Science, 29(5):506–512, 2020

    Nick Chater, Jian-Qiao Zhu, Jake Spicer, Joakim Sundh, Pablo León-Villagrá, and Adam Sanborn. Probabilistic biases meet the bayesian brain.Current Directions in Psychological Science, 29(5):506–512, 2020

  25. [25]

    Random House, 2025

    Kyle Chayka.Filterworld: How algorithms flattened culture. Random House, 2025

  26. [26]

    Allison Chen, Sunnie S. Y . Kim, Amaya Dharmasiri, Olga Russakovsky, and Judith E. Fan. Portraying large language models as machines, tools, or companions affects what mental capacities humans attribute to them. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, New York, NY , USA,

  27. [27]

    Association for Computing Machinery

  28. [28]

    Understanding the role of human intuition on reliance in human-AI decision-making with explanations

    Valerie Chen, Q Vera Liao, Jennifer Wortman Vaughan, and Gagan Bansal. Understanding the role of human intuition on reliance in human-AI decision-making with explanations. Proceedings of the ACM on Human-computer Interaction, 7(CSCW2):1–32, 2023

  29. [29]

    ELEPHANT: Measuring and understanding social sycophancy in LLMs

    Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Social sycophancy: A broader understanding of llm sycophancy.arXiv preprint arXiv:2505.13995, 2025

  30. [30]

    How individual traits and language styles shape preferences in open-ended user-llm interaction: A preliminary study

    Rendi Chevi, Kentaro Inui, Thamar Solorio, and Alham Fikri Aji. How individual traits and language styles shape preferences in open-ended user-llm interaction: A preliminary study. arXiv preprint arXiv:2504.17083, 2025

  31. [31]

    Avishek Choudhury and Zaira Chaudhry. Large language models and user trust: consequence of self-referential learning loop and the deskilling of health care professionals.Journal of Medical Internet Research, 26:e56764, 2024

  32. [32]

    arXiv preprint arXiv:2501.10476 (2025)

    Katherine M Collins, Umang Bhatt, and Ilia Sucholutsky. Revisiting rogers’ paradox in the context of human-AI interaction.arXiv preprint arXiv:2501.10476, 2025. 11

  33. [33]

    arXiv preprint arXiv:2407.12804 (2024)

    Katherine M Collins, Valerie Chen, Ilia Sucholutsky, Hannah Rose Kirk, Malak Sadek, Holli Sargeant, Ameet Talwalkar, Adrian Weller, and Umang Bhatt. Modulating language model experiences through frictions.arXiv preprint arXiv:2407.12804, 2024

  34. [34]

    Building machines that learn and think with people.Nature human behaviour, 8(10):1851–1863, 2024

    Katherine M Collins, Ilia Sucholutsky, Umang Bhatt, Kartik Chandra, Lionel Wong, Mina Lee, Cedegao E Zhang, Tan Zhi-Xuan, Mark Ho, Vikash Mansinghka, et al. Building machines that learn and think with people.Nature human behaviour, 8(10):1851–1863, 2024

  35. [35]

    Survival of the best fit.USA

    Gabor Csapo, Jihyun Kim, Miha Klasinc, and Alia ElKattan. Survival of the best fit.USA. https://www. survivalofthebestfit. com, 2019

  36. [36]

    Can Democracy Survive the Disruptive Power of AI? — carnegieendowment.org

    Raluca Csernatoni. Can Democracy Survive the Disruptive Power of AI? — carnegieendowment.org. https://carnegieendowment.org/research/2024/12/ can-democracy-survive-the-disruptive-power-of-ai?lang=en , 2024. [Accessed 09-05-2025]

  37. [37]

    AI and procurement.Manufacturing & Service Operations Management, 24(2):691–706, 2022

    Ruomeng Cui, Meng Li, and Shichen Zhang. AI and procurement.Manufacturing & Service Operations Management, 24(2):691–706, 2022

  38. [38]

    Automation and accountability in decision support system interface design

    Mary L Cummings. Automation and accountability in decision support system interface design. 2006

  39. [39]

    Mixed-initiative creative interfaces

    Sebastian Deterding, Jonathan Hook, Rebecca Fiebrink, Marco Gillies, Jeremy Gow, Memo Akten, Gillian Smith, Antonios Liapis, and Kate Compton. Mixed-initiative creative interfaces. InProceedings of the 2017 CHI conference extended abstracts on human factors in computing systems, pages 628–635, 2017

  40. [40]

    Multicalibration for confidence scoring in llms.arXiv preprint arXiv:2404.04689, 2024

    Gianluca Detommaso, Martin Bertran, Riccardo Fogliato, and Aaron Roth. Multicalibration for confidence scoring in llms.arXiv preprint arXiv:2404.04689, 2024

  41. [41]

    Algorithm aversion: people erroneously avoid algorithms after seeing them err.Journal of experimental psychology: General, 144(1):114, 2015

    Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: people erroneously avoid algorithms after seeing them err.Journal of experimental psychology: General, 144(1):114, 2015

  42. [42]

    Overcoming algorithm aversion: People will use imperfect algorithms if they can (even slightly) modify them.Management Science, 64(3):1155–1170, 2018

    Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Overcoming algorithm aversion: People will use imperfect algorithms if they can (even slightly) modify them.Management Science, 64(3):1155–1170, 2018

  43. [43]

    The role of trust in automation reliance.International journal of human-computer studies, 58(6):697–718, 2003

    Mary T Dzindolet, Scott A Peterson, Regina A Pomranky, Linda G Pierce, and Hall P Beck. The role of trust in automation reliance.International journal of human-computer studies, 58(6):697–718, 2003

  44. [44]

    Relational norms for human-AI cooperation.arXiv preprint arXiv:2502.12102, 2025

    Brian D Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, et al. Relational norms for human-AI cooperation.arXiv preprint arXiv:2502.12102, 2025

  45. [45]

    How AI and Human Behaviors Shape Psychosocial Effects of Extended Chatbot Use: A Longitudinal Randomized Controlled Study

    Cathy Mengying Fang, Auren R Liu, Valdemar Danry, Eunhae Lee, Samantha WT Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe, Lama Ahmad, et al. How AI and human behaviors shape psychosocial effects of chatbot use: A longitudinal randomized controlled study.arXiv preprint arXiv:2503.17473, 2025

  46. [46]

    Kevin Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S Weld, Amy X Zhang, and Joseph Chee Chang

    K.J. Kevin Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S Weld, Amy X Zhang, and Joseph Chee Chang. Cocoa: Co-planning and co-execution with AI agents.arXiv preprint arXiv:2412.10999, 2024

  47. [47]

    The human factor of AI: Implications for critical thinking and societal anxieties.TECHNOLOGY AND SOCIETY: Boon or Bane?, page 8, 2025

    Michael Gerlich. The human factor of AI: Implications for critical thinking and societal anxieties.TECHNOLOGY AND SOCIETY: Boon or Bane?, page 8, 2025

  48. [48]

    Human trust in artificial intelligence: Review of empirical research.Academy of management annals, 14(2):627–660, 2020

    Ella Glikson and Anita Williams Woolley. Human trust in artificial intelligence: Review of empirical research.Academy of management annals, 14(2):627–660, 2020

  49. [49]

    Paul Grice

    H. Paul Grice. Logic and conversation. In Donald Davidson, editor,The logic of grammar, pages 64–75. Dickenson Pub. Co., 1975. 12

  50. [50]

    Griffiths

    Thomas L. Griffiths. Understanding human intelligence through human limitations.Trends in Cognitive Sciences, 24(11):873–883, 2020

  51. [51]

    MIT Press, 2024

    Thomas L Griffiths, Nick Chater, and Joshua B Tenenbaum.Bayesian models of cognition: reverse engineering the mind. MIT Press, 2024

  52. [52]

    A decision theoretic frame- work for measuring AI reliance

    Ziyang Guo, Yifan Wu, Jason D Hartline, and Jessica Hullman. A decision theoretic frame- work for measuring AI reliance. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 221–236, 2024

  53. [53]

    Taking advice: Accepting help, improving judgment, and sharing responsibility.Organizational Behavior and Human Decision Processes, 70(2):117– 133, 1997

    Nigel Harvey and Ilan Fischer. Taking advice: Accepting help, improving judgment, and sharing responsibility.Organizational Behavior and Human Decision Processes, 70(2):117– 133, 1997

  54. [54]

    Plan-then-execute: An empirical study of user trust and team performance when using llm agents as a daily assistant

    Gaole He, Gianluca Demartini, and Ujwal Gadiraju. Plan-then-execute: An empirical study of user trust and team performance when using llm agents as a daily assistant. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, New York, NY , USA, 2025. Association for Computing Machinery

  55. [55]

    Knowing about knowing: An illusion of human competence can hinder appropriate reliance on AI systems

    Gaole He, Lucie Kuiper, and Ujwal Gadiraju. Knowing about knowing: An illusion of human competence can hinder appropriate reliance on AI systems. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY , USA, 2023. Association for Computing Machinery

  56. [56]

    Trust in automation: Integrating empirical evidence on factors that influence trust.Human factors, 57(3):407–434, 2015

    Kevin Anthony Hoff and Masooda Bashir. Trust in automation: Integrating empirical evidence on factors that influence trust.Human factors, 57(3):407–434, 2015

  57. [57]

    Principles of mixed-initiative user interfaces

    Eric Horvitz. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’99, page 159–166, New York, NY , USA, 1999. Association for Computing Machinery

  58. [58]

    Yoyo Tsung-Yu Hou and Malte F Jung. Who is the expert? reconciling algorithm aversion and algorithm appreciation in AI-supported decision making.Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2):1–25, 2021

  59. [59]

    Position: We need an adaptive interpretation of helpful, honest, and harmless principles.arXiv preprint arXiv:2502.06059, 2025

    Yue Huang, Chujie Gao, Yujun Zhou, Kehan Guo, Xiangqi Wang, Or Cohen-Sasson, Max Lamparth, and Xiangliang Zhang. Position: We need an adaptive interpretation of helpful, honest, and harmless principles.arXiv preprint arXiv:2502.06059, 2025

  60. [60]

    Decision theoretic foundations for experiments evaluating human decisions

    Jessica Hullman, Alex Kale, and Jason Hartline. Decision theoretic foundations for experiments evaluating human decisions. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2025

  61. [61]

    Monitoring human dependence on AI systems with reliance drills.arXiv preprint arXiv:2409.14055, 2024

    Rosco Hunter, Richard Moulange, Jamie Bernardi, and Merlin Stein. Monitoring human dependence on AI systems with reliance drills.arXiv preprint arXiv:2409.14055, 2024

  62. [62]

    Multi-turn evaluation of anthropomorphic behaviours in large language models.arXiv preprint arXiv:2502.07077, 2025

    Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Mered- ith Ringel Morris, Kevin R McKee, Verena Rieser, Murray Shanahan, and Laura Weidinger. Multi-turn evaluation of anthropomorphic behaviours in large language models.arXiv preprint arXiv:2502.07077, 2025

  63. [63]

    Training language models to be warm and empathetic makes them less reliable and more sycophantic.arXiv preprint arXiv:2507.21919, 2025

    Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. Training language models to be warm and empathetic makes them less reliable and more sycophantic.arXiv preprint arXiv:2507.21919, 2025

  64. [64]

    To- wards interactive evaluations for interaction harms in human-ai systems.arXiv preprint arXiv:2405.10632, 2024

    Lujain Ibrahim, Saffron Huang, Umang Bhatt, Lama Ahmad, and Markus Anderljung. To- wards interactive evaluations for interaction harms in human-ai systems.arXiv preprint arXiv:2405.10632, 2024

  65. [65]

    Kahr, Gerrit Rooks, Chris Snijders, and Martijn C

    Patricia K. Kahr, Gerrit Rooks, Chris Snijders, and Martijn C. Willemsen. The trust recovery journey. the effect of timing of errors on the willingness to follow AI advice. InProceedings of the 29th International Conference on Intelligent User Interfaces, IUI ’24, page 609–622, New York, NY , USA, 2024. Association for Computing Machinery. 13

  66. [66]

    Capturing humans’ mental models of AI: An item response theory approach

    Markelle Kelly, Aakriti Kumar, Padhraic Smyth, and Mark Steyvers. Capturing humans’ mental models of AI: An item response theory approach. InProceedings of the 2023 ACM conference on fairness, accountability, and transparency, pages 1723–1734, 2023

  67. [67]

    I’m Not Sure, But

    Sunnie S. Y . Kim, Q Vera Liao, Mihaela V orvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. "I’m Not Sure, But...": Examining the Impact of Large Language Models’ Uncer- tainty Expression on User Reliance and Trust. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 822–835, 2024

  68. [68]

    Sunnie S. Y . Kim, Jennifer Wortman Vaughan, Q. Vera Liao, Tania Lombrozo, and Olga Russakovsky. Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies. InACM Conference on Human Factors in Computing Systems (CHI), 2025

  69. [69]

    Sunnie S. Y . Kim, Elizabeth Anne Watkins, Olga Russakovsky, Ruth Fong, and Andrés Monroy-Hernández. Humans, AI, and Context: Understanding End-Users’ Trust in a Real- World Computer Vision Application. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 77–88, New York, NY , USA, 2023. Association for C...

  70. [70]

    Algorithmic monoculture and social welfare.Proceed- ings of the National Academy of Sciences, 118(22):e2018340118, 2021

    Jon Kleinberg and Manish Raghavan. Algorithmic monoculture and social welfare.Proceed- ings of the National Academy of Sciences, 118(22):e2018340118, 2021

  71. [71]

    Large language models, politics, and the functionalization of language.AI and Ethics, pages 1–13, 2024

    Olya Kudina and Bas de Boer. Large language models, politics, and the functionalization of language.AI and Ethics, pages 1–13, 2024

  72. [72]

    Towards a science of human-AI decision making: An overview of design space in empirical human- subject studies

    Vivian Lai, Chacha Chen, Alison Smith-Renner, Q Vera Liao, and Chenhao Tan. Towards a science of human-AI decision making: An overview of design space in empirical human- subject studies. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1369–1385, 2023

  73. [73]

    Selective ex- planations: Leveraging human input to align explainable AI.Proceedings of the ACM on Human-Computer Interaction, 7(CSCW2):1–35, 2023

    Vivian Lai, Yiming Zhang, Chacha Chen, Q Vera Liao, and Chenhao Tan. Selective ex- planations: Leveraging human input to align explainable AI.Proceedings of the ACM on Human-Computer Interaction, 7(CSCW2):1–35, 2023

  74. [74]

    The impact of generative AI on critical thinking: Self-reported reductions in cognitive effort and confidence effects from a survey of knowledge workers

    Hao-Ping (Hank) Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wilson. The impact of generative AI on critical thinking: Self-reported reductions in cognitive effort and confidence effects from a survey of knowledge workers. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25...

  75. [75]

    Trust, control strategies and allocation of function in human- machine systems.Ergonomics, 35(10):1243–1270, 1992

    John D Lee and Neville Moray. Trust, control strategies and allocation of function in human- machine systems.Ergonomics, 35(10):1243–1270, 1992

  76. [76]

    Trust, self-confidence, and operators’ adaptation to automation

    John D Lee and Neville Moray. Trust, self-confidence, and operators’ adaptation to automation. International journal of human-computer studies, 40(1):153–184, 1994

  77. [77]

    Trust in automation: Designing for appropriate reliance

    John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance. Human factors, 46(1):50–80, 2004

  78. [78]

    Griffiths

    Falk Lieder and Thomas L. Griffiths. Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources.Behavioral and Brain Sciences, 43:e1, 2020

  79. [79]

    Large language models assume people are more rational than we really are.arXiv preprint arXiv:2406.17055, 2024

    Ryan Liu, Jiayi Geng, Joshua C Peterson, Ilia Sucholutsky, and Thomas L Griffiths. Large language models assume people are more rational than we really are.arXiv preprint arXiv:2406.17055, 2024

  80. [80]

    Logg, Julia A

    Jennifer M. Logg, Julia A. Minson, and Don A. Moore. Algorithm appreciation: People prefer algorithmic to human judgment.Organizational Behavior and Human Decision Processes, 151:90–103, 2019. 14

Showing first 80 references.