Beliefs and Misconceptions around Integrated Conversational AI
Pith reviewed 2026-06-30 20:11 UTC · model grok-4.3
The pith
Citations in integrated conversational AI raise perceived trustworthiness without prompting users to verify the sources.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Participants relied on a combination of existing perceptions of LLMs and internet search, tracing the effect of beliefs about how Copilot generated answers on prompting strategies. The inclusion of citations increased the trustworthiness of answers without participants feeling the need to check them, with participants often reaching for the same information sources as the CAI when fact-checking.
What carries the argument
The controlled user study of 20 participants performing information-retrieval and planning tasks inside a browser extension, which links beliefs about answer generation to prompting choices and citation-driven trust.
If this is right
- Including citations in AI responses can raise user acceptance of outputs even when verification does not occur.
- Users may mirror the source selection of the integrated AI during their own fact-checking.
- Pre-existing beliefs about how LLMs work directly shape how people phrase prompts to the system.
- Trust mechanisms in integrated AI rest on surface markers such as citations rather than on active source inspection.
Where Pith is reading between the lines
- Designers of other embedded AI tools may see similar citation effects if the integration hides the AI's origins as effectively as a browser extension does.
- Over-reliance on AI-listed sources could narrow the range of information people encounter when double-checking answers.
- Future interfaces might need explicit prompts or training to encourage verification outside the AI's cited set.
- The pattern could intensify as conversational features spread across more productivity applications beyond browsers.
Load-bearing premise
Observations from 20 participants in controlled tasks inside one browser extension will generalize to everyday real-world use of integrated conversational AI.
What would settle it
A follow-up observation in which users in uncontrolled settings independently verify citations or select different sources than those listed by the AI would undermine the reported trust and fact-checking pattern.
read the original abstract
LLM-driven conversational AI is beginning to disappear into the background, shifting from something used directly towards something increasingly integrated into existing workflows. In the process, markers of origin and training are smoothed away as LLMs become commodified in the eyes of users. We explore how people approach using a web browser with conversational AI built in, focusing on how they develop their understanding and determine whether to trust its outputs. We conducted a study where 20 participants used the Copilot AI features in Microsoft Edge to conduct information retrieval and planning tasks. Participants relied on a combination of existing perceptions of LLMs and internet search, tracing the effect of beliefs about how Copilot generated answers on prompting strategies. The inclusion of citations increased the trustworthiness of answers without participants feeling the need to be check them, with participants often reaching for the same information sources as the CAI when fact-checking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a qualitative study in which 20 participants performed information-retrieval and planning tasks inside the Microsoft Edge browser using its integrated Copilot feature. It claims that participants drew on prior beliefs about LLMs and web search to shape their prompting, that the presence of citations increased perceived trustworthiness without prompting verification, and that participants tended to consult the same external sources that Copilot itself referenced when they did check facts.
Significance. If the reported patterns hold beyond the specific experimental context, the work would supply useful empirical grounding for how trust and verification behaviors emerge when conversational AI is embedded in everyday tools rather than used as a standalone interface. The study design itself, however, supplies no evidence that the observed effects are stable properties of integrated CAI rather than artifacts of the single-extension, short-session, controlled-task setting.
major comments (2)
- [Methods / Results] The central claims about citation effects on trustworthiness and source-matching in fact-checking rest on an analysis whose protocol, coding scheme, and reliability metrics are not described. The abstract states the findings but the Methods and Results sections supply no detail on interview protocol, coding scheme, inter-rater reliability, or limitations; without these the reader cannot assess whether the reported patterns are reproducible or whether they are shaped by the particular task framing.
- [Discussion / Limitations] The generalizability concern is load-bearing: the study uses a single browser extension, 20 participants, and controlled information-retrieval/planning tasks. The manuscript does not discuss how the novelty of the tool, the short session length, or the participant pool might have produced prompting and verification habits that would not appear in everyday, multi-tool, long-term use of integrated conversational AI.
minor comments (1)
- [Introduction] The abstract and introduction use the term 'integrated conversational AI' without a precise operational definition or comparison to standalone chat interfaces; a short clarifying paragraph would help readers map the claims onto existing literature.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments. We find the feedback constructive and will revise the manuscript to address the concerns raised regarding methodological transparency and generalizability.
read point-by-point responses
-
Referee: [Methods / Results] The central claims about citation effects on trustworthiness and source-matching in fact-checking rest on an analysis whose protocol, coding scheme, and reliability metrics are not described. The abstract states the findings but the Methods and Results sections supply no detail on interview protocol, coding scheme, inter-rater reliability, or limitations; without these the reader cannot assess whether the reported patterns are reproducible or whether they are shaped by the particular task framing.
Authors: We agree with this assessment. The initial manuscript did not provide adequate detail on the qualitative analysis process. In the revised version, we will include a detailed description of the interview protocol (including the semi-structured questions used), the inductive thematic analysis approach, the coding scheme with examples, and any steps taken for reliability such as multiple coders reviewing transcripts. We will also explicitly discuss limitations related to task framing in a new Limitations section. This will allow readers to better evaluate the reproducibility and context of our findings. revision: yes
-
Referee: [Discussion / Limitations] The generalizability concern is load-bearing: the study uses a single browser extension, 20 participants, and controlled information-retrieval/planning tasks. The manuscript does not discuss how the novelty of the tool, the short session length, or the participant pool might have produced prompting and verification habits that would not appear in everyday, multi-tool, long-term use of integrated conversational AI.
Authors: We acknowledge that the manuscript's limitations section is underdeveloped on these points. While our study is positioned as an initial exploration of integrated CAI use, we will expand the Discussion to address how the specific context (single tool, short sessions, controlled tasks, participant demographics) may influence the observed behaviors. We will discuss potential novelty effects, the difference between lab-like sessions and naturalistic long-term use, and suggest directions for future research to test generalizability across tools and over time. This revision will better frame the scope of our claims without overstating them. revision: yes
Circularity Check
No circularity: empirical qualitative study with direct observations
full rationale
The paper reports results from a controlled user study with 20 participants performing information-retrieval and planning tasks in Microsoft Edge Copilot. No equations, fitted parameters, derivations, or mathematical predictions appear in the abstract or described content. Central claims rest on direct participant observations rather than any self-referential reduction, self-citation chain, or renamed known result. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Qualitative analysis of participant behavior and statements can reliably reveal mental models of AI systems
Reference graph
Works this paper leans on
-
[1]
2024.Influencer Ad Disclosure on Social Media: Instagram and TikTok
Advertising Standards Agency. 2024.Influencer Ad Disclosure on Social Media: Instagram and TikTok. Technical Report. https://www.asa.org.uk/ resource/influencer-ad-disclosure-on-social-media-instagram-and-tiktok-2024.html
2024
-
[2]
Frank Bentley, Chris Luvogt, Max Silverman, Rushani Wirasinghe, Brooke White, and Danielle Lottridge. 2018. Understanding the Long-Term Use of Smart Speaker Assistants.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.2, 3, Article 91 (Sept. 2018), 24 pages. doi:10.1145/3264901
-
[3]
Michelle Brachman, Qian Pan, Hyo Jin Do, Casey Dugan, Arunima Chaudhary, James M. Johnson, Priyanshu Rai, Tathagata Chakraborti, Thomas Gschwind, Jim A Laredo, Christoph Miksovic, Paolo Scotton, Kartik Talamadupula, and Gegi Thomas. 2023. Follow the Successful Herd: Towards Explanations for Improved Use and Mental Models of Natural Language Systems. InPro...
-
[4]
2012.Thematic analysis.American Psychological Association
Virginia Braun and Victoria Clarke. 2012.Thematic analysis.American Psychological Association
2012
-
[5]
2021.Thematic analysis: A practical guide
Virginia Braun and Victoria Clarke. 2021.Thematic analysis: A practical guide. SAGE publications Ltd
2021
-
[6]
Goran Bubaš, Snježana Babić, and Antonela Čižmešija. 2023. Usability and User Experience Related Perceptions of University Students Regarding the Use of Bing Chat Search Engine and AI Chatbot: Preliminary Evaluation of Assessment Scales. In2023 IEEE 21st Jubilee International Symposium on Intelligent Systems and Informatics (SISY). 000607–000612. doi:10.1...
-
[7]
Sara Cannizzaro, Rob Procter, Sinong Ma, and Carsten Maple. 2020. Trust in the smart home: Findings from a nationally representative survey in the UK.Plos one15, 5 (2020), e0231615
2020
-
[8]
Avishek Choudhury and Hamid Shamszare. 2023. Investigating the Impact of User Trust on the Adoption and Use of ChatGPT: Survey Analysis.J Med Internet Res25 (14 Jun 2023), e47184. doi:10.2196/47184
-
[9]
Meghan Clark, Mark W. Newman, and Prabal Dutta. 2017. Devices and Data and Agents, Oh My: How Smart Home Abstractions Prime End-User Mental Models.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.1, 3, Article 44 (Sept. 2017), 26 pages. doi:10.1145/3132031
-
[10]
Benjamin R. Cowan, Nadia Pantidi, David Coyle, Kellie Morrissey, Peter Clarke, Sara Al-Shehri, David Earley, and Natasha Bandeira. 2017. "What can i help you with?": infrequent users’ experiences of intelligent personal assistants. InProceedings of the 19th International Conference on Human- Computer Interaction with Mobile Devices and Services(Vienna, Au...
-
[11]
Peter J. Denning. 2025. In Large Language Models We Trust?Commun. ACM68, 6 (June 2025), 23–25. doi:10.1145/3726009
-
[12]
Hyo Jin Do, Michelle Brachman, Casey Dugan, Qian Pan, Priyanshu Rai, James M. Johnson, and Roshni Thawani. 2024. Evaluating What Others Say: The Effect of Accuracy Assessment in Shaping Mental Models of AI Systems.Proc. ACM Hum.-Comput. Interact.8, CSCW2, Article 373 (Nov. 2024), 26 pages. doi:10.1145/3686912
-
[13]
Josh Freeman. 2025. Student generative AI survey 2025.Higher Education Policy Institute: London, UK(2025)
2025
-
[14]
Millen, Murray Campbell, Sadhana Kumaravel, and Wei Zhang
Katy Ilonka Gero, Zahra Ashktorab, Casey Dugan, Qian Pan, James Johnson, Werner Geyer, Maria Ruiz, Sarah Miller, David R. Millen, Murray Campbell, Sadhana Kumaravel, and Wei Zhang. 2020. Mental Models of AI Agents in a Cooperative Game Setting. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’20). Asso...
-
[15]
G. Mark Grimes, Ryan M. Schuetzler, and Justin Scott Giboney. 2021. Mental models and expectation violations in conversational AI interactions. Decision Support Systems144 (2021), 113515. doi:10.1016/j.dss.2021.113515
-
[16]
Ellie Harmon and Melissa Mazmanian. 2013. Stories of the Smartphone in everyday discourse: conflict, tension & instability. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Paris, France)(CHI ’13). Association for Computing Machinery, New York, NY, USA, 1051–1060. doi:10.1145/2470654.2466134
-
[17]
Horstmann, Clara Strathmann, Lea Lambrich, and Nicole C
Aike C. Horstmann, Clara Strathmann, Lea Lambrich, and Nicole C. Krämer. 2023. Alexa, What’s Inside of You: A Qualitative Study to Explore Users’ Mental Models of Intelligent Voice Assistants. InProceedings of the 23rd ACM International Conference on Intelligent Virtual Agents(Würzburg, Germany)(IV A ’23). Association for Computing Machinery, New York, NY...
-
[18]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Natural Language Generation.ACM Comput. Surv.55, 12, Article 248 (March 2023), 38 pages. doi:10.1145/3571730
-
[19]
Prerna Juneja, Wenjuan Zhang, Alison Marie Smith-Renner, Hemank Lamba, Joel Tetreault, and Alex Jaimes. 2024. Dissecting users’ needs for search result explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 841, 17 pages. doi:...
-
[20]
Yongnam Jung, Cheng Chen, Eunchae Jang, and S. Shyam Sundar. 2024. Do We Trust ChatGPT as much as Google Search and Wikipedia?. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA ’24). Association for Computing Machinery, New York, NY, USA, Article 111, 9 pages. doi:10.1145/3613905.3650862
-
[21]
Ilkka Kaate, Joni Salminen, Soon-Gyo Jung, Trang Thi Thu Xuan, Essi Häyhänen, Jinan Y. Azem, and Bernard J. Jansen. 2025. “You Always Get an Answer”: Analyzing Users’ Interaction with AI-Generated Personas Given Unanswerable Questions and Risk of Hallucination. InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). As...
-
[22]
Markelle Kelly, Aakriti Kumar, Padhraic Smyth, and Mark Steyvers. 2023. Capturing Humans’ Mental Models of AI: An Item Response Theory Approach. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Manuscript submitted to ACM Beliefs and Misconceptions around Integrated Conver...
-
[23]
Changhyun Lee and Kyungjin Cha and. 2024. Toward the Dynamic Relationship Between AI Transparency and Trust in AI: A Case Study on ChatGPT.International Journal of Human–Computer Interaction0, 0 (2024), 1–18. arXiv:https://doi.org/10.1080/10447318.2024.2405266 doi:10.1080/ 10447318.2024.2405266
-
[24]
Sunok Lee, Minji Cho, and Sangsu Lee. 2020. What If Conversational Agents Became Invisible? Comparing Users’ Mental Models According to Physical Entity of AI Speaker.Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.4, 3, Article 88 (Sept. 2020), 24 pages. doi:10.1145/3411840
-
[25]
Houjiang Liu, Anubrata Das, Alexander Boltz, Didi Zhou, Daisy Pinaroc, Matthew Lease, and Min Kyung Lee. 2024. Human-centered NLP Fact-checking: Co-Designing with Fact-checkers using Matchmaking for AI.Proc. ACM Hum.-Comput. Interact.8, CSCW2, Article 423 (Nov. 2024), 44 pages. doi:10.1145/3686962
-
[26]
Ewa Luger and Abigail Sellen. 2016. "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems(San Jose, California, USA)(CHI ’16). Association for Computing Machinery, New York, NY, USA, 5286–5297. doi:10.1145/2858036.2858288
-
[27]
David Lyell and Enrico Coiera. 2017. Automation bias and verification complexity: a systematic review.Journal of the American Medical Informatics Association24, 2 (2017), 423–431
2017
-
[28]
Maria Madsen and Shirley Gregor. 2000. Measuring human-computer trust. In11th australasian conference on information systems, Vol. 53. Citeseer, 6–8
2000
-
[29]
Kirsti Malterud, Volkert Dirk Siersma, and Ann Dorrit Guassora. 2016. Sample size in qualitative interview studies: guided by information power. Qualitative health research26, 13 (2016), 1753–1760
2016
-
[30]
Dogan Gursoy Mesut Cicek and Lu Lu. 2024. Adverse impacts of revealing the presence of “Artificial Intelligence (AI)” technology in product and service descriptions on purchase intentions: the mediating role of emotional trust and the moderating role of perceived risk.Journal of Hospitality Marketing & Management0, 0 (2024), 1–23. doi:10.1080/19368623.202...
-
[31]
Brent Daniel Mittelstadt, Patrick Allo, Mariarosaria Taddeo, Sandra Wachter, and Luciano Floridi. 2016. The ethics of algorithms: Mapping the debate.Big Data & Society3, 2 (2016), 2053951716679679
2016
-
[32]
Vikram Mohanty, Jude Lim, and Kurt Luther. 2025. What Lies Beneath? Exploring the Impact of Underlying AI Model Updates in AI-Infused Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 539, 21 pages. doi:10.1145/3706598.3713751
-
[33]
Kathleen L Mosier and Linda J Skitka. 2018. Human decision makers and automated decision aids: Made for each other? InAutomation and human performance. CRC Press, 201–220
2018
-
[34]
Mahsan Nourani, Chiradeep Roy, Jeremy E Block, Donald R Honeycutt, Tahrima Rahman, Eric Ragan, and Vibhav Gogate. 2021. Anchoring Bias Affects Mental Model Formation and User Reliance in Explainable AI Systems. InProceedings of the 26th International Conference on Intelligent User Interfaces(College Station, TX, USA)(IUI ’21). Association for Computing Ma...
-
[35]
Saumya Pareek, Niels van Berkel, Eduardo Velloso, and Jorge Goncalves. 2024. Effect of Explanation Conceptualisations on Trust in AI-assisted Credibility Assessment.Proc. ACM Hum.-Comput. Interact.8, CSCW2, Article 383 (Nov. 2024), 31 pages. doi:10.1145/3686922
-
[36]
Sohyun Park, Someen Park, Jaehoon Kim, and Kyungsik Han. 2024. Exploring the Impact of AI-Generated Images on Political News Perception and Understanding. InCompanion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing(San Jose, Costa Rica)(CSCW Companion ’24). Association for Computing Machinery, New York, NY, U...
-
[37]
2026.Accenture ’links staff promotions to use of AI tools’
Joanna Partridge. 2026.Accenture ’links staff promotions to use of AI tools’. https://www.theguardian.com/accenture/2026/feb/19/accenture-links- staff-promotions-to-use-of-ai-tools
2026
-
[38]
It’s Weird That it Knows What I Want
James Prather, Brent N. Reeves, Paul Denny, Brett A. Becker, Juho Leinonen, Andrew Luxton-Reilly, Garrett Powell, James Finnie-Ansley, and Eddie Antonio Santos. 2023. “It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers.ACM Trans. Comput.-Hum. Interact.31, 1, Article 4 (Nov. 2023), 31 pages. doi:10.1145/3617367
-
[39]
William Seymour and Jose Such. 2023. Ignorance is Bliss? The Effect of Explanations on Perceptions of Voice Assistants.Proc. ACM Hum.-Comput. Interact.7, CSCW1, Article 64 (April 2023), 24 pages. doi:10.1145/3579497
-
[40]
William Seymour and Max Van Kleek. 2021. Exploring Interactions Between Trust, Anthropomorphism, and Relationship Development in Voice Assistants.Proc. ACM Hum.-Comput. Interact.5, CSCW2, Article 371 (Oct. 2021), 16 pages. doi:10.1145/3479515
-
[41]
Nikhil Sharma, Q. Vera Liao, and Ziang Xiao. 2024. Generative Echo Chamber? Effect of LLM-Powered Search Systems on Diverse Information Seeking. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 1033, 17 pages. doi:10.1145/3613904.3642459
-
[42]
Stephen P Stich and Shaun Nichols. 2003. Folk psychology.The blackwell guide to philosophy of mind(2003), 235–255
2003
-
[43]
Ana Stojanov, Qian Liu, and Joyce Hwee Ling Koh. 2024. University students’ self-reported reliance on ChatGPT for learning: A latent profile analysis.Computers and Education: Artificial Intelligence6, 4 (2024), 100243
2024
-
[44]
Haoheng Tang and Mrinalini Singha. 2024. A Mystery for You: A fact-checking game enhanced by large language models (LLMs) and a tangible interface. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI EA ’24). Association for Computing Machinery, New York, NY, USA, Article 631, 5 pages. doi:10.1145/361390...
-
[45]
Paul Thomas, Bodo Billerbeck, Nick Craswell, and Ryen W. White. 2019. Investigating Searchers’ Mental Models to Inform Search Explanations. ACM Trans. Inf. Syst.38, 1, Article 10 (dec 2019), 25 pages. doi:10.1145/3371390 Manuscript submitted to ACM 16 Seymour and Jenkins, et al
- [46]
-
[47]
Zamfirescu-Pereira, Richmond Y
J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang. 2023. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. doi:10.1...
-
[48]
Xia Zeng, David La Barbera, Kevin Roitero, Arkaitz Zubiaga, and Stefano Mizzaro. 2024. Combining Large Language Models and Crowdsourcing for Hybrid Human-AI Misinformation Detection. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Ma...
-
[49]
Xiao Zhan, Juan-Carlos Carrillo, William Seymour, and Jose Such. 2025. Malicious LLM-Based Conversational AI Makes Users Reveal Personal Information. In34th USENIX Security Symposium. USENIX Association
2025
-
[50]
Amy X. Zhang, Aditya Ranganathan, Sarah Emlen Metz, Scott Appling, Connie Moon Sehat, Norman Gilmore, Nick B. Adams, Emmanuel Vincent, Jennifer Lee, Martin Robbins, Ed Bice, Sandro Hawke, David Karger, and An Xiao Mina. 2018. A Structured Response to Misinformation: Defining and Annotating Credibility Indicators in News Articles. InCompanion Proceedings o...
-
[51]
Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G Parker, and Munmun De Choudhury. 2023. Synthetic Lies: Understanding AI-Generated Misinformation and Evaluating Algorithmic and Human Solutions. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.